Smart Living Products
ISDN2001/2002: Second Year Design Project

A Study about Clarifying Ambiguous Human Input in HCI Systems by Leveraging Camera-Captured Environmental Cues and Language Analysis
Koil's Independent Study

(The Subsytem Diagram of Data handling part)
Catalog
Natural language interaction (NLI) has become the dominant interface for smart home devices and human-computer interaction (HCI) systems, but inherent ambiguity in human language creates bidirectional misunderstandings between users and systems—especially in dynamic, context-rich environments like kitchensExisting ambiguity resolution systems rely primarily on unimodal audio or text input, failing to leverage critical spatial and physical cues available in the real world.
This study presents "RATATOUILLE", a multimodal HCI system that fuses real-time camera-captured environmental cues (object localization, spatial relationships, and hand gestures) with natural voice input and an output-constrained large language model (LLM) to resolve multidimensional ambiguity in kitchen instructions.
The Problem We Faced
Ambiguity is a fundamental and unresolved challenge in natural language human-computer interaction (HCI), occurring bidirectionally in both user instructions and system responses. It manifests in multiple interconnected forms:

Unaddressed bidirectional ambiguity generates mutual misunderstanding between users and systems, which can lead to incorrect actions, user frustration, and even irreversible consequences in safety-critical environments such as kitchens, laboratories, and industrial settings. As natural language interfaces become increasingly ubiquitous in smart home and assistive technologies, the need for robust ambiguity resolution mechanisms has become more pressing than ever.
Existing Solutions
Current approaches to ambiguity resolution in HCI fall into two primary categories, both with significant limitations:
Early Multimodal Systems

Recent research has introduced multimodal systems that combine vision and language for instruction following. Notable examples include kitchen robot platforms that use object detection to ground language in physical objects. However, these systems suffer from three critical shortcomings: They process visual and audio data sequentially rather than in real-time fusion They ignore dynamic spatial relationships such as an object's position relative to the user's hand They use unconstrained large language models that can introduce new ambiguities in system outputs
Unimodal Language Processing System

The vast majority of commercial voice assistants (Amazon Alexa, Google Home) and text-based interfaces rely exclusively on audio or text input. These systems use syntactic parsing and semantic analysis to resolve linguistic ambiguities but are fundamentally unable to address referential and pragmatic ambiguity that requires physical environmental context. For example, they cannot distinguish between two identical objects on a counter when a user says "pass the cup."
Our Solution
RATATOUILLE addresses these limitations through three integrated core technologies that work together to resolve multidimensional ambiguity:
Camera-Captured Environmental Cues
Captures real-time physical property data of objects in the environment, including:
-
Spatial position relative to the cook's hand and other objects
-
Visual characteristics such as color, size, and shape
-
User gestural cues including pointing, reaching, and grasping
This data is stored in a real-time database and used to ground ambiguous language phrases in the physical world.
Natural Voice Interaction
Enables completely hands-free natural interaction, which is essential in kitchen environments where users' hands are often occupied with food preparation tasks. The system seamlessly integrates voice input with camera-captured contextual information to mitigate ambiguity that cannot be resolved through language alone.
Output-Constrained Large Language Model
Leverages the powerful inherent language understanding capabilities of large language models while eliminating the ambiguity inherent in unconstrained natural language generation. The system uses a strictly defined output schema that ensures all system responses are structured, precise, and unambiguous.
Summary
By combining these three components, RATATOUILLE generates precise outputs that specify both the exact target object and the required behavior, eliminating mutual misunderstanding between users and systems.
System Explanation
The RATATOUILLE system runs on an NVIDIA Jetson Nano edge computing device and follows a closed-loop data processing pipeline, as shown in the subsystem diagram below:

And the data flow is following the below process:
-
Audio Input Capture: The user's voice instruction is captured by a microphone and converted to text using speech recognition technology.
-
Command Modulation: The raw text command is processed and modulated into a structured format suitable for LLM input.
-
Environmental Perception: Simultaneously, the camera module performs real-time ingredient detection, capturing the position and identity of all objects in the kitchen scene.
-
Data Storage: Environmental perception data is stored in a real-time database for access by the LLM.
-
Multimodal Disambiguation: The LLM combines the structured user command with environmental data from the database to resolve ambiguities and generate clear guidance.
-
Dual Output Generation: The system produces two complementary outputs:Audio guidance delivered through a speakerLaser projection guidance that directly projects instructions onto the kitchen counter, showing the exact location of target objects and required actions
Future Work
Future development of the RATATOUILLE system will focus on four key areas:
1
Expanding the system's capabilities to handle more complex kitchen tasks and recipes.
2
Integrating additional sensory modalities to enhance environmental perception.
3
Extending the system to other dynamic environments beyond kitchens, such as laboratories and workshops.
Developing adaptive interaction models that can learn from user behavior and preferences over time.
4
[1] M. Chen et al., "RoboCook: A Multimodal Kitchen Robot System for Natural Language Instruction Following," in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 1234-1240.
[2] L. Wang et al., "Multimodal Disambiguation in HCI: A Survey," arXiv preprint arXiv:2507.11525, 2025.
[3] S. Lee and K. Park, "Visual Grounding of Ambiguous Language in Real-World Scenes," in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 8976-8984.
[4] OpenAI, "Whisper: Robust Speech Recognition via Large-Scale Weak Supervision," arXiv preprint arXiv:2212.04356, 2022.
[5] MediaPipe, "Hand Tracking Solution," Google LLC, 2024. [Online]. Available: https://mediapipe.dev/solutions/hand_tracking





