Open Voice-Chat Agentic Model Design, Evaluation and Benchmarking

Key Topics and Subtopics

To structure the session, we will focus on several interconnected subtopics, each crucial for establishing comprehensive benchmarks for voice-chat agents:

Voice-Chat Model Design

Neural network model design for voice-chat function and tasks.

Real-Time Agent Evaluation

Metrics and methods for measuring system latency, turn-taking speed, and real-time interaction quality.

Benchmarking Frameworks for Voice-Chat Systems

Design of open benchmarks and evaluation toolkits that cover the voice-chat capabilities of modern chat agents (speech, text, and possibly vision).

Integration of Speech Models with LLMs

Strategies for large language models to create agentic conversational AI. This topic covers how to evaluate the combined system.

Session Format

The session will be organized as a sequence of invited talks and paper presentations, followed by an interactive panel discussion. We plan to feature 6 to 8 contributed papers (selected via peer review of this special session's submissions). Each paper will present either a novel voice-chat model design, voice-chat task, evaluation methodology, a benchmarking dataset, or an implemented system related to the session theme.

Expected Impact

By defining currently undefined tasks and metrics for voice-chat agents, this special session will lay the groundwork for standardized evaluations in our community. In the long term, having common benchmarks will accelerate progress by enabling researchers to compare systems directly and track improvements over time. We expect lively participation from both academic researchers and industrial practitioners (e.g., teams working on digital assistants and customer service bots), fostering collaborations that bridge speech and language disciplines. Ultimately, this session aligns with IEEE ASRU's goal of advancing speech understanding technologies: it will help answer "How do we know if one conversational system is better than another?" in a rigorous way. Establishing these evaluation criteria and benchmarks now will shape research and development of next-generation conversational AI that is more accurate, real-time, privacy-preserving, and robust in the open world.

Session Overview

This special session aims to foster the design of open voice-chat agentic systems and the definition of evaluation benchmarks for such agents. Recent advances in speech technology and large language models (LLMs) have enabled voice-based conversational agents that can engage in free-form dialogue and carry out complex tasks. However, the field currently lacks well-defined tasks and standardized metrics to evaluate these multimodal dialogue systems' performance.

Objectives

The primary objective is to define and accelerate the development of open evaluation frameworks for voice-interactive AI agents.

Organizers and Contact Information

Dr. Huck Yang – Senior Research Scientist, NVIDIA Research

Email: huckyang@nvidia.com

Prof. Yun-Nung (Vivian) Chen – Professor, Department of Computer Science & Information Engineering, National Taiwan University

Email: y.v.chen@ieee.org

Prof. Larry Heck – Professor (ECE & Interactive Computing), Georgia Institute of Technology; Georgia Research Alliance Eminent Scholar

Email: larryheck@gatech.edu

Organizer Biographies

Dr. Huck Yang

Dr. Huck Yang is a Senior Research Scientist at NVIDIA Research focusing on speech-language cross-modal alignment with several representative works on input-based model adaptation (i.e., speech-based in-context learning, textual prompting, and system reprogramming.) He received his Ph.D. from the Georgia Institute of Technology and has served as an area chair or committee member on parameter-efficient learning at IEEE ICASSP 2022 to 2025, EMNLP 2024 and ACL 2025. Dr. Yang has organized special sessions and tutorials in Interspeech 2023, ICASSP 2022 to 2024, and EMNLP 2025.

Prof. Yun-Nung (Vivian) Chen

Prof. Yun-Nung (Vivian) Chen is a Professor in the Department of Computer Science and Information Engineering at National Taiwan University, specializing in spoken dialogue systems, language understanding, and multimodal language processing. She earned her Ph.D. from Carnegie Mellon University. She leads the MiU Lab at NTU, has published extensively on conversational AI, and her expertise in dialog system evaluation and user-centric AI design will inform the session's focus on benchmarking conversational agents.

Prof. Larry Heck

Prof. Larry Heck is a Professor with joint appointments in Electrical & Computer Engineering and Interactive Computing at the Georgia Institute of Technology, and holds the Rhesa S. Farmer Chair of Advanced Computing Concepts as a Georgia Research Alliance Eminent Scholar. An IEEE Fellow, Prof. Prof. Heck's unparalleled industry experience in creating and evaluating voice assistants makes him ideally suited to co-organize this special session on voice-agent benchmarking.