Speech, Audio, and Talking-Head AI

Speech-language alignment, paralinguistic evaluation, empathetic speech benchmarks, and audio-driven video generation.

Speech and audio AI
Soundwave TalkVid EchoMind S2S-Arena Audio-driven video
Interactive AI pipeline

The speech and audio line extends the lab's model ecosystem from text and images to how people actually communicate: voice, prosody, empathy, paralinguistic cues, and audio-driven face or talking-head generation.

Research Storyline

Align
Soundwave connects speech and text

The first question is how little speech supervision is needed to ground LLMs in spoken input and output.

Evaluate
S2S-Arena tests how things are spoken

Speech-to-speech systems need to follow paralinguistic instructions, not just transcribe or answer textually.

Empathy
EchoMind measures affective speech behavior

Empathetic speech requires multi-level evaluation of emotion, response style, and conversational alignment.

Embody
TalkVid turns speech into talking-head generation

Audio-driven video generation connects speech understanding to visible, embodied communication.

Project Threads

Soundwave

Studies speech-text alignment for LLMs, asking how much speech supervision is actually needed for strong speech-language grounding.

S2S-Arena

Evaluates paralinguistic instruction following in speech-to-speech systems, moving beyond word accuracy into how something is spoken.

EchoMind

Benchmarks empathetic speech language models with multi-level evaluation of affective and conversational signals.

TalkVid

Provides a diversified dataset for audio-driven talking-head synthesis across multiple languages and video generation settings.

Display Figures

Paper Trail

Speech
Soundwave: Less is More for Speech-Text Alignment in LLMs

Studies efficient speech-text alignment so LLMs can connect spoken signals and language behavior.

Repository
Eval
S2S-Arena and EchoMind

Benchmarks paralinguistic instruction following and empathetic speech model behavior.

S2S-Arena
Video
TalkVid

Provides diversified audio-driven talking-head data so voice can control visible speech behavior.

Dataset

Why It Matters

  • Human-facing agents need to understand how speech is delivered, not only the transcript.
  • Medical, education, and companionship scenarios require empathy, timing, tone, and paralinguistic control.
  • Audio-driven video generation connects speech understanding with embodied and visual communication.

Resource Map

TalkVid

Large-scale diversified dataset for audio-driven talking-head synthesis.

Dataset
S2S-Arena

Evaluation benchmark for paralinguistic instruction following in speech-to-speech models.

Repository
EchoMind

Benchmark site for empathetic speech language model evaluation.

Project