IEEE ASRU 2025 Tutorial

Multimodal Speech Modeling

From Understanding to Generation

December 6, 2025

9:00 AM – 12:00 PM

Honolulu, Hawaii

Part 1: Understanding

Generative perspective of multimodal speech understanding with LMs

Part 2: Generation

Visual generation toward VideoGen & Multimodal conditional AudioGen

Part 3: Towards a Future of Interactive Systems

Open opportunities in multimodal speech modeling

Speakers

World-class researchers from leading institutions

Dr. David M. Chan

UC Berkeley

davidchan@berkeley.edu

David M. Chan, Ph.D., is a postdoctoral scholar at the University of California, Berkeley, specializing in multimodal learning, human-computer interaction, and generative AI, including large language models (LLMs) and foundational models. His research focuses on developing efficient and scalable methods for integrating vision, audio, and language models, improving AI-human collaboration (i.e., Video Dubbing), and mitigating hallucination in generative AI systems. In industry, he has worked with leading organizations including Amazon, Google and NASA. He is also the maintainer of TSNE-CUDA, an open-source GPU-accelerated tool for high-dimensional data visualization used by over 40,000 researchers. He holds both a PhD and MS in Computer Science from UC Berkeley and dual BSc degrees in Computer Science and Mathematics from the University of Denver.

Related Publications

Multi-Modal Pre-Training for Automated Speech Recognition, ICASSP 2022
Clair: Evaluating Image Captions with Large Language Models, EMNLP 2023
Multimodal Attention Merging for Improved Speech Recognition, ICASSP 2024
ANIM-400K: Large-Scale Dataset for Automated End-To-End Dubbing, ICASSP 2024

Dr. Huck Yang

NVIDIA Research

hucky@nvidia.com

Dr. Huck Yang is a Sr. Research Scientist at NVIDIA Research working on robust speech recognition and its connections to large-scale multimodal models. Prior to joining NVIDIA, he was a scientist at Amazon Alexa ASR-LM and a research intern at Google Speech/Brain. His research focuses on speech–language alignment, large-scale multilingual speech modeling, and adapting LLMs for speech processing tasks. Dr. Yang developed the first cross-modal acoustic alignment techniques [ICML 2021] to integrate temporal signals into LLMs. He obtained his Ph.D. and M.Sc. from Georgia Institute of Technology under Prof. Chin-Hui Lee and B.Sc from National Taiwan University. He has served in the Technical Committee in Applied Signal Processing Systems of IEEE SPS since 2022 and received a Best Student Paper Award Nomination at Interspeech 2023 and Best Industry Paper Honorable Mention Award at ACL 2025.

Related Publications

Test-Time Alignment for Large Language Models via Textual Model Predictive Control, ArXiv 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM, NVIDIA Tech Report
Audio Flamingo 3: Fully Open Large Audio Language Models, NeurIPS 2025 Spotlight
NeKo: Cross-Modal Generative Correction LLMs, ACL 2025 (oral)
OWLS: Scaling Laws for Multilingual Speech Models, ICML 2025

Dr. XuDong Wang

Meta Superintelligence Labs

xdwang@eecs.berkeley.edu

XuDong is a Research Scientist with Meta Superintelligence Labs and an incoming Assistant Professor at Duke University. He received his Ph.D. in EECS from UC Berkeley, where he was affiliated with the Berkeley AI Research (BAIR) lab under the mentorship of Prof. Trevor Darrell. Before joining BAIR, he was a staff member of the International Computer Science Institute (ICSI). He received his Master's degree in Intelligent Systems, Robotics, and Control from UC San Diego, advised by Prof. Nuno Vasconcelos. His research focuses on visual generation and multimodal learning.

Related Publications

InstanceDiffusion: Instance-level Control for Image Generation, CVPR 2024
Segment Anything without Supervision, NeurIPS 2024
Visual Lexicon: Rich Image Features in Language Space, CVPR 2025
Reconstruction Alignment Improves Unified Multimodal Models, ArXiv 2025
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, ArXiv 2025

Dr. Apoorv Vyas

Meta FAIR

apoorvvyas@meta.com

Dr. Apoorv Vyas is a Research Scientist at Meta's Fundamental AI Research (FAIR) team, specializing in audio generation and multimodal learning. He earned his Ph.D. in Electrical Engineering from École Polytechnique Fédérale de Lausanne (EPFL), where he focused on improving speech recognition for low-resource languages through unsupervised learning. His research interests span deep learning, automatic speech recognition, audio generation, and computer vision. Prior to his doctoral studies, Apoorv gained industry experience at Intel as a System Engineer and at Oracle as an Applications Engineer. He holds a Bachelor's degree in Electronics and Electrical Engineering from the Indian Institute of Technology, Guwahati. At Meta, he has contributed to groundbreaking work on audio generation models including AudioCraft and multimodal conditional generation systems.

Related Publications

Voicebox: Text-guided multilingual universal speech generation at scale, NeurIPS 2023
Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound, Meta Tech Report 2025
AudioBox: Unified Audio Generation with Natural Language Prompts, Meta Tech Report 2025

Multimodal Speech Modeling

From Understanding to Generation

Part 1: Understanding

Part 2: Generation

Part 3: Towards a Future of Interactive Systems

Schedule

Speakers

Dr. David M. Chan

Dr. Huck Yang

Dr. XuDong Wang

Dr. Apoorv Vyas