top of page

Evaluating AI Reliability in Detecting Conversational Speech Levels Through Comparative Human Analysis

By: Fiona Keira Prajitno

image.png

Abstract

This exploratory study investigates whether an AI language model can evaluate communication complexity in short spoken responses in a way that aligns with human assessment. Rather than diagnosing speech or communication ability, the study focuses on observable features such as response length, elaboration, reasoning, and reflection. A small pilot dataset of 20 conversational responses was analyzed using a five-level communication complexity rubric. Each response was rated by both a human evaluator and ChatGPT, and the ratings were compared to determine alignment.

image.png

Research Question

Can AI-generated evaluations of conversational responses produce communication-level ratings similar to human ratings?

Methodology

          Participants answered short conversational prompts designed to produce responses of varying complexity. Each verbal response was transcribed into text. A human evaluator then rated each transcript using a five-level communication complexity rubric. ChatGPT evaluated the same transcripts using the same rubric.

         The rubric was informed by pragmatic language assessment, where communication is evaluated not only by grammar or vocabulary, but also by social use, elaboration, and communicative effectiveness [1]. The study also used principles from language sample analysis, where measurable features such as response length, utterance complexity, and word count are commonly used to evaluate spoken language development [2]. Because the rating scale uses ordered categories from Level 1 to Level 5, Spearman correlation was used to compare human and AI ratings [3].

Communication Complexity Rubric

1

2

3

Single word or very short response

Simple sentence or preference statement

Descriptive response with some elaboration 

4

5

Explanation, reasoning, or problem-solving

Reflection, emotional insight, or perspective-taking

Results

         A total of 20 conversational responses were analyzed. ChatGPT’s ratings matched the human ratings in 17 out of 20 responses, giving an overall agreement rate of 85%.

         The strongest agreement occurred at the lower and higher ends of the rubric. AI was able to identify very simple responses, such as single-word answers, as Level 1. It also identified longer reflective responses with emotional insight as Level 5. Most disagreements occurred between Level 3 and Level 4, where responses included moderate elaboration but limited reasoning or emotional reflection.

         The Spearman correlation coefficient between human and AI ratings was 0.82, indicating a strong positive relationship between both evaluations. A positive trend was also observed between communication complexity and response length, where higher-level responses generally contained more detailed and reflective language.

image.png

Discussion

 

        The findings suggest that AI-based transcript analysis can approximate human evaluation of communication complexity in short conversational responses. This supports the idea that AI may be useful as an assistive tool for analyzing language samples, especially when evaluation is based on structured rubric criteria and observable conversational features [2].

        In professional speech and language assessment, communication ability is not evaluated solely through vocabulary or sentence length. Speech-language pathologists typically assess multiple qualitative features, including conversational turn-taking, topic maintenance, pragmatic appropriateness, elaboration, emotional understanding, and perspective-taking during interaction [1][4][5]. These professional evaluation methods influenced the five-level communication complexity rubric used in this study.

        The results showed that ChatGPT performed well when identifying very simple responses as well as highly reflective responses involving emotional insight. However, the AI sometimes overestimated responses that were longer but lacked meaningful reasoning or reflection. This suggests that word count alone is insufficient for determining communication complexity. Qualitative elements such as explanation, emotional awareness, contextual relevance, and perspective-taking remain important factors in communication evaluation [1][5].

        Most disagreements occurred between Level 3 and Level 4 responses, where communication complexity became more subjective and context-dependent. Human evaluators were generally better at recognizing subtle emotional meaning and conversational nuance that AI occasionally overlooked. This aligns with existing research suggesting that while language models can identify linguistic patterns effectively, they may still struggle with pragmatic interpretation and socially contextualized communication [7].

        Despite these limitations, the relatively high agreement rate and strong Spearman correlation suggest that AI may still have potential as a supportive analytical tool for communication research, transcript evaluation, and adaptive conversational systems [3][7].

Conclusion

        This exploratory study suggests that ChatGPT can provide communication complexity ratings that broadly align with human evaluation. Across 20 conversational responses, the AI achieved an 85% agreement rate with human scoring and demonstrated a strong positive correlation between human and AI evaluations.

        The findings also reflect several characteristics commonly used in professional communication assessment, including response elaboration, conversational reasoning, emotional reflection, and pragmatic appropriateness [1][4]. While AI cannot replace professional judgement, it may be useful as a lightweight support tool for analyzing conversational responses and informing adaptive communication technologies.

        Future studies should include a larger and more diverse dataset, multiple human evaluators, and additional conversational features such as tone, hesitation, pauses, and emotional expression to improve reliability and evaluation depth [2][5].

References

[1] Adams, C. (2002). Practitioner review: The assessment of language pragmatics. Journal of Child Psychology and Psychiatry.

[2] Klatte, I. S., et al. (2022). Language sample analysis in clinical practice: Speech-language pathologists’ barriers, facilitators, and needs. Language, Speech, and Hearing Services in Schools.

[3] Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology.

[4] Tager-Flusberg, P. (1996). Pragmatic language in autism. Autism and Asperger Syndrome.

[5] Paul, R., & Norbury, C. (2012). Language Disorders From Infancy Through Adolescence (4th ed.). Elsevier.

[6] Vygotsky, L. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.

[7] Brown, T. B., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.

ISDN2001/2002: Second Year Design Project

bottom of page