top of page

Speech to Text (STT) Benchmark

Which STT engine in 2025 is the most suitable one for the 6 to 12-year-old Hong Kong dyslexic users?

Short Abstract

This study provides a comprehensive comparison of the leading STT Engines of 2025 and evaluates them under 4 speech conditions: Clear speech, Noisy speech, Accent speech, and Specialist speech. The study showed that OpenAI Whisper and Gemini performed equally well, but OpenAI Whisper was better suited to WordCraft's user base. Where Whisper excels in noisy environments, Gemini leads in accented and technical speech. Whereas Google Cloud ASR and Microsoft Azure perform poorly in most conditions.

image.png

Research Context

There are a variety of STT engines on the market, but each one specializes in different areas. Selecting the most appropriate STT engine for Wordcraft on a peer-to-peer basis for children with dyslexia in HK requires a more complex framework that simulates real-world challenges. WordCraft needs to handle user inputs such as Hong Kong accents and noisy classroom/road environments, and therefore needs an API that can perform well under these conditions. 

image.png

Research Methods

In order to achieve the research objectives, a customized dataset containing 10 minutes of speech that covering four different conditions were selected.

Clean Speech

Speech from a TED talk, representing a noise-free environment.

Noisy Speech

Pure speech added noise from the school environment.

Accent Speech

Spoken voice from TVB News that includes a Hong Kong accent.

Specialist Speech

A YouTube video by a famous surgeon.

The performance of each API is evaluated using the Word Error Rate (WER) as the main metric.

However, due to the need for multi-dimensional comparisons, decimals are not very practical and intuitive. And the higher the WER, the worse the Engine performance data is counterintuitive. Therefore, 1/WER was used as the formula to calculate the scores obtained by STT engine under different speech conditions.

In this study, different weights were assigned to the four speech conditions: 0.15 for clear speech, 0.4 for noisy environment, 0.4 for accented speech, and 0.05 for professional speech for dyslexic children between the ages of 6 and 12 in Hong Kong who need to use the product in a classroom environment. The scores of each engine were weighted by the scores of the different speech conditions.

image.png

Study Result

Whisper and Gemini perform equally well, but Whisper is better suited to the WordCraft user. Where Whisper performs well in noisy environments, Gemini leads in accented and technical speech. Whereas Google Cloud and Microsoft Azure perform poorly in most conditions. By weighting the scores, finally get that OpenAI Whisper is the most suitable STT engine for WordCraft.

image.png

ISDN2001/2002: Second Year Design Project

bottom of page