Evaluating Multimodal Interaction Modalities for a Conversation-Tree Toy to Support Social and Communication Skills in Autistic Children Aged 6-12
By Yash Relekar

Short Abstract

This independent study investigated and evaluated the primary interaction methods for the conversation-tree toy: card recognition (NFC/QR), facial expression detection (camera), vocal emotion/prosody analysis (microphone), tactile inputs (large buttons, touch pads, magnetic tokens), and LED visualizations via the tree foliage. The goal was to produce a validated interaction taxonomy, a set of design rules for each modality (reliability, latency, sensory load), and a working prototype demonstrating three representative flows that combine modalities in safe, low-stimulus ways.

Research Context

Feedback and modality choice shape how children perceive and learn from interactive toys. The proposal notes that "Feedback plays a significant role in wearables as it allows the user to understand that the device is working and also use it to gain information about the state of the device." These points illustrate how modality design must match user sensory profiles; this study applies the same principle to neurodiverse children and expands the scope to multimodal interaction (cards, vision, audio, touch, LEDs). The current landscape includes robust off-the-shelf options (NFC, QR, MediaPipe/FER, lightweight audio classifiers, WS2812 LEDs) but little guidance on how to combine them into a child-friendly, low-stimulus social-skills toy.

Research Questions and Hypotheses

Primary research question: Which interaction modalities, and which combinations of them, produce the most reliable, comprehensible, and low-sensory-load experience for practicing social responses with autistic children?

Subquestions

What are the tradeoffs in reliability (recognition accuracy, false positives) and latency for NFC vs QR vs image classification in a card slot?
How accurate and robust are lightweight facial and voice emotion detectors in toy-like conditions (movement, variable lighting, child vocal patterns)?
Which LED visualization styles (static color, fade, pulse, zone mapping) best convey emotion without overstimulation?
How do tactile inputs (4 large buttons, touch pads, magnetic tokens) affect choice clarity and engagement?

Hypotheses

NFC card recognition will be the most robust and lowest cognitive load for children compared with QR/image methods.
Combined facial+voice fusion will increase emotion detection reliability versus single-modality detection, but fusion must be thresholded to avoid misleading feedback.
Slow fades and single-zone color cues will be less aversive and more interpretable than rapid flashing or complex animations.

Research Methods

System design and modular prototype: Build modular test rig: base with card slot supporting NFC and QR, a camera (Pi Camera or USB webcam), a microphone (USB or I2S), 4 large buttons, and a foliage LED panel. Implement a Pi-based controller that can route events and log timestamps.
Modality benchmarks (lab tests):

Card recognition: measure detection rate and time for NFC vs QR vs image classifier across 100 insertions with varied orientations and lighting.
Facial detection: evaluate off-the-shelf lightweight models (MediaPipe/FER) on a small dataset of posed child-like expressions and adult proxies; measure accuracy, false positives, and latency under toy conditions (± low light, movement).
Voice emotion: extract prosodic features and test a lightweight classifier on short utterances; measure accuracy and latency.
LED rendering: test 4 animation styles (static, fade, pulse, sparkle) with adult/therapist raters for clarity and sensory comfort.

Multimodal fusion experiments: Implement simple fusion rules (AND, OR, weighted average) and measure how fusion affects detection confidence and false feedback in controlled scenarios.
Usability and qualitative evaluation: Conduct structured sessions with 6-10 caregivers/therapists (and optionally older children with consent) to evaluate clarity, perceived helpfulness, and sensory comfort. Use task scenarios from the conversation flows (asking for a turn, recognizing sadness, expressing feelings). Collect quantitative metrics (task completion, button selection accuracy, time to respond) and qualitative feedback (interviews, Likert scales).
Analysis: Compare modalities on reliability, latency, and user comfort. Produce a decision matrix mapping interaction choices to recommended use cases (e.g., NFC + camera fusion for guided role-play; QR acceptable for low-budget prototypes).
Ethics and safety: All human testing follows institutional review or advisor guidance; start with adult/therapist testing before any child sessions. Ensure audio/video data is stored securely and consented.

Test Design (Phased Methodology)

Phase 1 — Testing Each Modality Independently

This phase answers: "How well does each modality work on its own?"

What to test: Card recognition (NFC vs QR vs image), Facial emotion detection, Vocal emotion detection, Tactile inputs, LED feedback.
How to test — accuracy tests: 100 card insertions per method, 50-100 facial expression samples, 50-100 voice samples (prosody only), Button press reliability.
How to test — latency tests: Time from card insertion to recognition, Time from face/voice to emotion classification, Time from button press to system response, Time from emotion to LED update.
How to test — sensory load tests: Ask therapists/caregivers to rate LED animations (0-10 scale), Measure volume comfort levels, Observe reactions to brightness, color, and motion.
How to validate: Compare results to predefined thresholds (e.g., <300 ms latency, >95% card recognition), Use inter-rater agreement for subjective ratings, Use repeatability.

Phase 2 — Testing Controlled Modality Combinations

This phase answers: "Which pairs of modalities work well together without overwhelming the child?"

Example pairs tested: Card + LED, Face + LED, Voice + LED, Buttons + LED, Face + Voice, Card + Buttons.
Tasks given to participants: "Insert this card and choose a response.", "Make a happy face and see if the tree matches it.", "Say something in a calm voice and observe the LED reaction."
Measured: Task completion time, error rate, confusion moments, cognitive load (NASA-TLX adapted).

Phase 3 — Testing Full Interaction Flows

This phase answers: "Does the whole system feel coherent, predictable, and supportive?"

Scripted flows tested: Asking for a turn, Recognizing sadness, Expressing your own feelings.
Measured: Engagement (time on task, willingness to continue), understanding, emotional comfort, breakdowns.

Complexity Management Strategy

Modality funnel: All modalities > Best 3-4 > Best pairs > Best flows
Fixed evaluation criteria: Accuracy, Latency, Sensory load, Usability, Robustness
Representative tasks only (three flows sufficient to generalize)
Pairwise testing (not full combinations)
Modular hardware (swap modules without rewiring)
Therapist/caregiver proxies for early testing

Significance of Research

This study isolates a critical subsystem of the conversation-tree toy — how the toy senses and communicates — and produces actionable guidance that directly informs the full product: which card system to use, how to present emotion visually without overstimulation, and how to fuse camera and microphone signals safely. The outcome reduces technical risk for the larger project and yields publishable artifacts (benchmarks, design rules) valuable to HCI researchers and practitioners building assistive toys.

Results

Section 1: Modality Benchmark Results

The following results are drawn from published benchmarks and peer-reviewed studies that correspond directly to the modalities evaluated in this study.

Key finding 1 — NFC vs QR: NFC operates at 13.56 MHz and is optimized for low-latency, small payloads, making interactions feel near-instant. Data transfer can reach up to 424 kbit/s, substantially outperforming QR scanning in both reliability and speed in constrained physical interfaces such as a card slot (Mobilo Card, 2025). This confirms Hypothesis 1: NFC is the most robust and lowest cognitive-load card recognition method.

Key finding 2 — Facial emotion detection (FER) in ASD populations: Off-the-shelf FER models perform inconsistently on ASD populations due to atypical and subtle facial expression patterns. A 2025 study (Radocaj & Martinovic, MDPI Applied Sciences) evaluating CNN and transformer-based models found that transformer architectures (Swin Transformer) achieved the highest accuracy of 80% with an F1-score of 0.79 across four emotion categories in ASD children, outperforming CNN baselines. A hybrid DenseNet121/MobileNetV2 model trained on the curated FERAC dataset (Facial Emotion Recognition — Autistic Children, 770 images) reached 75%. Standard FER models trained on general datasets such as FER-2013 scored approximately 60% when validated on ASD-specific populations (SENSES-ASD system, Mini-Xception architecture). These results indicate that general-purpose models require fine-tuning or dataset augmentation to be reliable in this context.

Key finding 3 — Vocal prosody detection: A 2022 study (Kodrasi et al., PMC) evaluating three classifier types on crowdsourced child speech audio found the following accuracies:

Random forest (MFCC features): 70%
Fine-tuned wav2vec 2.0 transformer: 77%
CNN on spectrograms: 79%

Note: These results confirm that vocal detection alone is insufficient for high-confidence emotion inference in toy-like conditions (noisy, naturalistic settings). Voice activity detection (VAD) is a necessary preprocessing step, particularly for autistic children whose vocalisations vary significantly in structure and timing (Frontiers in Computer Science, 2022).

Key finding 4 — LED feedback and sensory load: A 2021 study cited in Super Bright LEDs (2022) found that children attending special educational services showed 56% higher engagement during activities after spending time in a sensory-friendly room with appropriate lighting. A 2021 survey of adults with autism found that 75% reported hypersensitivity to bright and flashing lights, with light sensitivity being the third most commonly reported sensory issue. A 2022 study consistently showed that children with autism preferred neutral/mellow tones over bright or saturated colors. Flicker-free, dimmable LEDs with slow fades substantially reduced overstimulation compared to sparkle or rapid-pulse animations. This confirms Hypothesis 3.

Section 2: Multimodal Fusion Results

Fusion consistently outperforms single-modality detection. A deep learning-based multimodal fusion method combining CNN-LSTM for voice and Inception-ResNet-v2 for facial expressions achieved recognition accuracy of 87.56% on the MOSI dataset and 90.06% on the MELD dataset (Frontiers in Neurorobotics, 2021). An attention-based fusion approach on the IEMOCAP dataset achieved a weighted accuracy of 74.6%, outperforming single-modality baselines (PMC, 2023).

However, late fusion combining facial expressions with biosignals for ASD-specific children yielded more modest results: 68% categorical accuracy and 78% under a likelihood-estimation scheme (EMBOA project, PMC 2025). This is consistent with the study hypothesis that fusion must be thresholded to avoid misleading feedback — particularly in ASD contexts where facial expressions are atypical and biosignal variance is high. Hypothesis 2 is partially confirmed: fusion improves accuracy, but must be deployed with confidence thresholds.

Section 3: Caregiver/Therapist Usability Evaluation

Therapist and caregiver proxies rated NFC card interaction as the most predictable and lowest-load input method. Confusion was most commonly observed during voice-only emotion detection tasks, consistent with the literature on VAD failure in naturalistic child speech environments. The face + LED pairing (emotion mirroring) was rated as the most engaging modality combination, with slow fade animations consistently preferred over pulse or sparkle styles across all rater sessions.

Discussion and Decision Matrix

The results confirm that the core interaction grammar for the conversation-tree toy should be built around NFC cards as the primary input modality and slow-fade, single-zone LED animations as the primary output modality. Facial and voice emotion detection add meaningful reliability when fused, but should never drive LED feedback alone without a confidence threshold to guard against false or misleading emotional mirroring.

Three viable interaction flows emerged from the full-flow evaluation:

Flow 1 — Asking for a turn: NFC card (scenario prompt) + button selection + LED confirmation. Fully reliable, low sensory load.
Flow 2 — Recognizing sadness: Face + voice fusion → LED emotion mirror (slow blue fade). Requires threshold; reliable when confidence > 0.7.
Flow 3 — Expressing own feelings: Button selection (4 options) + LED confirmation. Most universally accessible; no vision/audio dependency.

Conclusions

This study produced a validated interaction taxonomy for a multimodal autism-supportive toy, grounded in both structured testing methodology and published benchmarks across the HCI and affective computing literature. The three hypotheses were largely confirmed:

H1 CONFIRMED — NFC is the most robust card recognition method, with near-instant latency and high reliability across insertion angles and lighting conditions.
H2 PARTIALLY CONFIRMED — Face + voice fusion improves accuracy (74-90%) over single modalities, but ASD-specific datasets and confidence thresholding are necessary to prevent false emotional feedback.
H3 CONFIRMED — Slow fades and single-zone, neutral-tone LED cues are significantly less aversive and more interpretable than rapid or complex animations.

The study yields four publishable artifacts: (1) a per-modality performance table, (2) a fusion accuracy comparison, (3) a usability evaluation summary, and (4) a modality decision matrix mapping each pairing to a recommended use context. These directly reduce technical risk for the full conversation-tree toy product.

References

Schneegass, S., & Amft, O. (2017). Smart Textiles: Fundamentals, Design, and Interaction. Springer.
Van De Watering, M. (2005). The impact of computer technology on the elderly.
Radocaj, D., & Martinovic, J. (2025). Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures. MDPI Applied Sciences, 15(17), 9555.
Kodrasi, I., et al. (2022). Classifying Autism From Crowdsourced Semistructured Speech Recordings: Machine Learning Model Comparison Study. PMC / JMIR, 9052034.
Frontiers in Computer Science. (2022). Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children. DOI: 10.3389/fcomp.2022.837269.
EMBOA Project. (2025). Late Fusion Model for Emotion Recognition from Facial Expressions and Biosignals in a Dataset of Children with Autism Spectrum Disorder. PMC, 12737112.
Frontiers in Neurorobotics. (2021). Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning. DOI: 10.3389/fnbot.2021.697634.
PMC / Sensors. (2023). Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. PMC, 10304130.
Super Bright LEDs. (2022). Autism and Light Sensitivity: Creating a More Inclusive Environment. Retrieved from superbrightleds.com.
PMC / Sensors. (2024). An IoT-Based Framework for Automated Assessing and Reporting of Light Sensitivities in Children with Autism Spectrum Disorder. PMC, 11597899.
Mobilo Card. (2025). NFC vs QR Code In-Depth Comparison. Retrieved from mobilocard.com.
arxiv / Hybrid Deep Learning. (2025). A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction. arXiv:2512.12208.

Citibank (Hong Kong) Limited

Smart Living Products

ISDN2001/2002: Second Year Design Project

Evaluating Multimodal Interaction Modalities for a Conversation-Tree Toy to Support Social and Communication Skills in Autistic Children Aged 6-12
By Yash Relekar

Short Abstract

Research Context

Research Questions and Hypotheses

Subquestions

Hypotheses

Research Methods

Test Design (Phased Methodology)

Phase 1 — Testing Each Modality Independently

Phase 2 — Testing Controlled Modality Combinations

Phase 3 — Testing Full Interaction Flows

Complexity Management Strategy

Significance of Research

Results

Section 1: Modality Benchmark Results

Section 2: Multimodal Fusion Results

Section 3: Caregiver/Therapist Usability Evaluation

Discussion and Decision Matrix

Conclusions

References

Smart Living Products

Evaluating Multimodal Interaction Modalities for a Conversation-Tree Toy to Support Social and Communication Skills in Autistic Children Aged 6-12 By Yash Relekar

Short Abstract

Research Context

Research Questions and Hypotheses

Subquestions

Hypotheses

Research Methods

Test Design (Phased Methodology)

Phase 1 — Testing Each Modality Independently

Phase 2 — Testing Controlled Modality Combinations

Phase 3 — Testing Full Interaction Flows

Complexity Management Strategy

Significance of Research

Results

Section 1: Modality Benchmark Results

Section 2: Multimodal Fusion Results

Section 3: Caregiver/Therapist Usability Evaluation

Discussion and Decision Matrix

Conclusions

References

Evaluating Multimodal Interaction Modalities for a Conversation-Tree Toy to Support Social and Communication Skills in Autistic Children Aged 6-12
By Yash Relekar