
Inclusive & Assistive Products
ISDN2001/2002: Second Year Design Project

Short Abstract
This study provides a comprehensive comparison of the leading STT Engines of 2025 and evaluates them under 4 speech conditions: Clear speech, Noisy speech, Accent speech, and Specialist speech. The study showed that OpenAI Whisper and Gemini performed equally well, but OpenAI Whisper was better suited to WordCraft's user base. Where Whisper excels in noisy environments, Gemini leads in accented and technical speech. Whereas Google Cloud ASR and Microsoft Azure perform poorly in most conditions.

Research Context
There are a variety of STT engines on the market, but each one specializes in different areas. Selecting the most appropriate STT engine for Wordcraft on a peer-to-peer basis for children with dyslexia in HK requires a more complex framework that simulates real-world challenges. WordCraft needs to handle user inputs such as Hong Kong accents and noisy classroom/road environments, and therefore needs an API that can perform well under these conditions.

Research Methods
In order to achieve the research objectives, a customized dataset containing 10 minutes of speech that covering four different conditions were selected.

Clean Speech
Speech from a TED talk, representing a noise-free environment.

Noisy Speech
Pure speech added noise from the school environment.

Accent Speech
Spoken voice from TVB News that includes a Hong Kong accent.

Specialist Speech
A YouTube video by a famous surgeon.
The performance of each API is evaluated using the Word Error Rate (WER) as the main metric.
However, due to the need for multi-dimensional comparisons, decimals are not very practical and intuitive. And the higher the WER, the worse the Engine performance data is counterintuitive. Therefore, 1/WER was used as the formula to calculate the scores obtained by STT engine under different speech conditions.
In this study, different weights were assigned to the four speech conditions: 0.15 for clear speech, 0.4 for noisy environment, 0.4 for accented speech, and 0.05 for professional speech for dyslexic children between the ages of 6 and 12 in Hong Kong who need to use the product in a classroom environment. The scores of each engine were weighted by the scores of the different speech conditions.

Study Result
Whisper and Gemini perform equally well, but Whisper is better suited to the WordCraft user. Where Whisper performs well in noisy environments, Gemini leads in accented and technical speech. Whereas Google Cloud and Microsoft Azure perform poorly in most conditions. By weighting the scores, finally get that OpenAI Whisper is the most suitable STT engine for WordCraft.
