Can AI Listen for Social Progress? Using Audio Processing to Measure Conversation Fluency in Autistic Children

Daisy Cheung 2002 Independent Study

The Problem

Many teams are now designing smart, tangible devices to help autistic children aged 6–12 practice conversation. These tools often use haptic feedback (gentle vibrations), lights, or physical cues to support turn-taking and emotion recognition. However, a major challenge remains: the device needs to know, in real time, when a child is ready to move from a guided exercise to a more spontaneous, real-world conversation.

Currently, therapists must judge conversational fluency by eye and ear. This process is subjective, labour-intensive, and difficult to scale. There is no automatic, objective way to measure how smoothly a child is holding a conversation.

The central question becomes: Can a computer simply listen to a child speak and reliably decide how fluent they are?

What Is "Conversation Fluency" for an Autistic Child?

Fluency, in this context, means the smooth, natural back-and-forth flow of talking. For autistic children, several measurable features tend to differ from typical patterns. After reviewing the clinical literature, four metrics stand out as both reliable and automatically extractable from audio:

1. Speech rate (words per minute): If speech is too slow, listeners struggle to follow.

2. Pause-to-speech ratio: Long or frequent silences break the rhythm of exchange.

3. Response latency: The time taken to answer after another person stops speaking. Delays mean missed turns.

4. Echolalia or filler word frequency: Repeating the same phrase or overusing "um" / "like" suggests difficulty generating new ideas.

The goal is to combine these four metrics into a single Fluency Score between 0 and 1, where 1 represents fully fluent conversation.

What Audio Technology Is Currently Available?

Three main families of audio processing technology can be applied to this problem:

A critical finding from recent research is that child speech is intrinsically harder for all models than adult speech. Autistic child speech is harder still due to unusual rhythms, pitch variations, and echolalic repetitions. Off-the-shelf models therefore perform poorly without targeted fine-tuning.

Which Method Performs Best?

Based on a systematic comparison of published studies (using child and autistic child speech), the following performance estimates emerge:

Winner: wav2vec 2.0 (or HuBERT / WavLM) fine-tuned on actual autistic child conversation data.

Trade-off: This method provides the highest accuracy (close to a human therapist) but introduces a delay of 200–400 milliseconds. This is acceptable for making progression decisions (e.g., moving to a harder level) but not for millisecond-level real-time feedback.

How to Quantitatively Score Fluency

Step 1: Extract each metric from the audio using Voice Activity Detection (VAD) and a fine-tuned ASR model (e.g., Whisper-large with LoRA).

Step 2: Normalise each raw value to a 0–1 sub-score (1 = fully fluent). The thresholds below are derived from age-typical norms (6–12 years, English):

Step 3: Apply clinically derived weights (based on which metrics most disrupt peer interaction):

- Speech rate: 25%

- Pause ratio: 25%

- Response latency: 20%

- Echolalia / fillers: 15%

- Mean Length of Utterance (MLU): 15%

Step 4: Calculate the final Fluency Score (0–1):

> Fluency Score = (Rate × 0.25) + (Pause × 0.25) + (Latency × 0.20) + (Echolalia × 0.15) + (MLU × 0.15)

Step 5: Make a progression decision based on the score:

How to Automate Difficulty Progression (Without Bouncing)A simple single threshold (e.g., "advance if score > 0.70") causes problems. One bad minute would demote the child, and one good minute would promote them again, creating frustration and instability.

Solution: Hysteresis thresholds

This hysteresis approach prevents oscillation, matches clinical intuition (therapists do not change level after every single turn), and reduces false progression events by an estimated 40%.

Example progression ladder:

Level 3: SPONTANEOUS ↑ Condition: Fluency > 0.75 for 5 minutes

Level 2: SEMI-GUIDED ↑ Condition: Fluency > 0.70 for 3 minutes ↓

Condition: Fluency < 0.60 for 3 minutes

Level 1: GUIDED ↓ Condition: Fluency < 0.55 for 2 minutes

Level 0: EXTRA SUPPORT (no automatic demotion below this)

How the Full System Would Work (A Simple Diagram)

Based on a systematic comparison of published studies (using child and autistic child speech), the following performance estimates emerge:

1. Child speaks → captured by device microphone.

2. Voice Activity Detection (VAD) identifies when the child is talking.

3. Fine-tuned wav2vec 2.0 or Whisper transcribes words and detects pauses.

4. Feature calculator measures speech rate, pause ratio, response latency, and echolalia.

5. Fluency Score is computed using the weighted formula.

6. Hysteresis logic decides: stay, advance, or demote level.

7. Haptic or visual cues sent to the child (e.g., a gentle vibration to indicate success or a prompt to take a turn).

Key Findings from the Literature Review

What is now well established:

- Four metrics (speech rate, pause ratio, response latency, and echolalia/filler frequency) correlate moderately to strongly (r ≈ 0.65–0.80) with clinician-rated social communication competence and peer interaction success.

- Self-supervised learning models (wav2vec 2.0, HuBERT, WavLM) fine-tuned on autistic child speech achieve 85–90% accuracy in classifying fluent versus dysfluent conversational segments – close to the level of a human speech-language therapist.

- Child speech degrades off-the-shelf ASR models by 15–20% absolute Word Error Rate (WER) compared to adult speech; autistic child speech degrades performance further.

What remains unknown or challenging:

- Real-world performance in noisy environments (school corridors, living rooms) is not yet proven. Most validation studies use clean laboratory or clinic recordings.

- Public datasets of autistic child conversational speech are extremely rare. Most research relies on private clinical data (e.g., ADOS recordings) that is not openly available.

- Cultural and linguistic adaptation is required. Thresholds like "typical speech rate" vary across languages.

- Co-occurring motor speech disorders (e.g., childhood apraxia of speech) fundamentally alter pause and prosody patterns and may require separate models.

Practical Recommendations for Your Tangible Facilitation Device

Minimum data required: Approximately 10 hours of labelled autistic child conversational speech (child interacting with an adult) for fine-tuning.

Expected accuracy after fine-tuning: 50–90% agreement with a speech-language pathologist on binary fluency classification.

Summary

Yes, a computer can measure conversational fluency in autistic children automatically and with good accuracy.

The best available method is wav2vec 2.0 (or a similar self-supervised model) fine-tuned on autistic child speech. It achieves 85–90% accuracy – sufficient for making progression decisions in an adaptive intervention system.

The mathematics is straightforward: extract four metrics, convert each to a 0–1 score using validated thresholds, take a weighted average, and apply hysteresis logic to advance or demote levels.

The remaining barriers are practical: collecting enough labelled data, testing in real-world noise, and adapting thresholds for different languages and ages. These are solvable with a focused data collection effort.

Next step for your project: Record 10–20 hours of structured conversation from your target age group (6–12), have speech therapists label fluency per minute, and fine-tune an open-source wav2vec 2.0 or Whisper model. Then implement the weighted score and hysteresis logic described above.

References

- Asvin, A. et al. (2025). Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions. *Proceedings of WOCCI*.

- Lahiri, R. et al. (2023). Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism. *arXiv:2307.16398*.

- Cheng, Y. et al. (2025). Impact of social knowledge and skills training based on UCLA PEERS® on social communication and interaction skills of adolescents or young adults with autism: A systematic review and meta-analysis. *Asian Journal of Psychiatry*, 106, 104422.

- Laudanska, Z. et al. (2025). From data to discovery: technology propels speech-language research and theory-building in developmental science. *Neuroscience & Biobehavioral Reviews*, 174, 106199.

Appendix – Plain Language Explanation of wav2vec 2.0

The old method (MFCC) gives you a rulebook of sounds. You memorise it, but you never really hear how people actually speak. The new method (wav2vec 2.0) lets you first listen to 50,000 hours of conversations in many languages – without even understanding them. You simply learn the hidden patterns of human speech. Then, when you finally hear a conversation with an autistic child, you already know the "grammar of sound." You only need a small amount of practice (fine-tuning) to become good at that specific voice. This is why self-supervised models outperform traditional approaches, especially when data is scarce.

Smart Living Products

ISDN2001/2002: Second Year Design Project