Speech, Audio, and Talking-Head AI

Speech and audio AI

Soundwave TalkVid EchoMind S2S-Arena Audio-driven video

The speech and audio line extends the lab's model ecosystem from text and images to how people actually communicate: voice, prosody, empathy, paralinguistic cues, and audio-driven face or talking-head generation.

Research Storyline

Align

Soundwave connects speech and text

The first question is how little speech supervision is needed to ground LLMs in spoken input and output.

Evaluate

S2S-Arena tests how things are spoken

Speech-to-speech systems need to follow paralinguistic instructions, not just transcribe or answer textually.

Empathy

EchoMind measures affective speech behavior

Empathetic speech requires multi-level evaluation of emotion, response style, and conversational alignment.

Embody

TalkVid turns speech into talking-head generation

Audio-driven video generation connects speech understanding to visible, embodied communication.

Project Threads

Soundwave

Studies speech-text alignment for LLMs, asking how much speech supervision is actually needed for strong speech-language grounding.

S2S-Arena

Evaluates paralinguistic instruction following in speech-to-speech systems, moving beyond word accuracy into how something is spoken.

EchoMind

Benchmarks empathetic speech language models with multi-level evaluation of affective and conversational signals.

TalkVid

Provides a diversified dataset for audio-driven talking-head synthesis across multiple languages and video generation settings.

Display Figures

Speech and interactive AI pipeline — The speech line shares a common pipeline with interactive agents: perceive, reason, respond, and evaluate behavior.

Open source impact — Open datasets and benchmarks make speech and embodied interaction research easier to reproduce.

Paper Trail

Speech

Soundwave: Less is More for Speech-Text Alignment in LLMs

Studies efficient speech-text alignment so LLMs can connect spoken signals and language behavior.

Repository

Eval

S2S-Arena and EchoMind

Benchmarks paralinguistic instruction following and empathetic speech model behavior.

S2S-Arena

Video

TalkVid

Provides diversified audio-driven talking-head data so voice can control visible speech behavior.

Dataset

Why It Matters

Human-facing agents need to understand how speech is delivered, not only the transcript.
Medical, education, and companionship scenarios require empathy, timing, tone, and paralinguistic control.
Audio-driven video generation connects speech understanding with embodied and visual communication.

Resource Map

TalkVid

Large-scale diversified dataset for audio-driven talking-head synthesis.

Dataset

S2S-Arena

Evaluation benchmark for paralinguistic instruction following in speech-to-speech models.

Repository

EchoMind

Benchmark site for empathetic speech language model evaluation.

Project