Spoken Conversational Agents
with Large Language Models
A half-day tutorial exploring the convergence of spoken conversational agents toward voice-native LLMs — from cascaded ASR/NLU to end-to-end, retrieval and vision-grounded systems, covering cross-modal alignment, joint speech–text training, and practical recipes for building robust conversational AI.
Tutorial Schedule
Half-day program featuring main talks and spotlight presentations
Conversational Systems & Agents
Foundations and research directions for spoken conversational agents with LLMs
Controllable Conversational AI & Task-Oriented Dialogue
Spotlight talk on controllable dialogue systems
Speech and Language Modeling
Techniques and perspectives on integrating speech models with LLMs
Bidirectional Human-AI Alignment & Value Alignment
Spotlight talk on spoken conversational agent alignment
End-to-End Spoken Language Models
SpeechGPT, SpeechTokenizer, and MiMo-Audio architecture
Multi-Modal Speech Agents & Reasoning
Integration of multimodal inputs and reasoning in speech-based agent design
Closing Remarks & Q&A
Wrap-up and pointers for further reading
Tutorial Materials
Download slides and supplementary materials from each speaker
All Presentation Materials
Access the complete collection of slides, demos, and resources from our Google Drive folder.
Citation
If you find this tutorial useful, please cite our work
BibTeX Reference
@inproceedings{yang-etal-2025-spoken,
title = "Spoken Conversational Agents with Large Language Models",
author = "Yang, Huck and
Stolcke, Andreas and
Heck, Larry P.",
editor = "Pyatkin, Valentina and
Vlachos, Andreas",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-tutorials.3/",
doi = "10.18653/v1/2025.emnlp-tutorials.3",
pages = "7--8",
ISBN = "979-8-89176-336-4",
}
Tutorial Recording
Watch the full tutorial presentation from EMNLP 2025
Watch Tutorial on YouTube
Spoken Conversational Agents with Large Language Models — Full Recording