Speech, Audio, and Talking-Head AI
Speech-language alignment, paralinguistic evaluation, empathetic speech benchmarks, and audio-driven video generation.
The speech and audio line extends the lab's model ecosystem from text and images to how people actually communicate: voice, prosody, empathy, paralinguistic cues, and audio-driven face or talking-head generation.
Research Storyline
The first question is how little speech supervision is needed to ground LLMs in spoken input and output.
Speech-to-speech systems need to follow paralinguistic instructions, not just transcribe or answer textually.
Empathetic speech requires multi-level evaluation of emotion, response style, and conversational alignment.
Audio-driven video generation connects speech understanding to visible, embodied communication.
Project Threads
Studies speech-text alignment for LLMs, asking how much speech supervision is actually needed for strong speech-language grounding.
Evaluates paralinguistic instruction following in speech-to-speech systems, moving beyond word accuracy into how something is spoken.
Benchmarks empathetic speech language models with multi-level evaluation of affective and conversational signals.
Provides a diversified dataset for audio-driven talking-head synthesis across multiple languages and video generation settings.
Display Figures
Paper Trail
Studies efficient speech-text alignment so LLMs can connect spoken signals and language behavior.
RepositoryBenchmarks paralinguistic instruction following and empathetic speech model behavior.
S2S-ArenaProvides diversified audio-driven talking-head data so voice can control visible speech behavior.
DatasetWhy It Matters
- Human-facing agents need to understand how speech is delivered, not only the transcript.
- Medical, education, and companionship scenarios require empathy, timing, tone, and paralinguistic control.
- Audio-driven video generation connects speech understanding with embodied and visual communication.
Resource Map
Large-scale diversified dataset for audio-driven talking-head synthesis.
DatasetEvaluation benchmark for paralinguistic instruction following in speech-to-speech models.
RepositoryBenchmark site for empathetic speech language model evaluation.
Project