TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

1The Chinese University of Hong Kong, Shenzhen 2Sun Yat-sen University 3The Hong Kong University of Science and Technology

Examples from our TalkVid dataset, showcasing the diversity in identity, ethnicity, and head pose.

Abstract

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age-groups. We argue this generalization gap is a direct symptom of fundamental limitations in existing training data, which lack the necessary scale, quality, and diversity.

To address this foundational challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability.

Furthermore, to enable more rigorous and equitable evaluation, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our comprehensive experiments demonstrate that a SOTA model trained on TalkVid significantly outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization and fairness.

Crucially, our analysis on TalkVid-Bench reveals critical performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. We will release TalkVid and TalkVid-Bench to the community to catalyze the development of the next generation of robust, generalizable, and equitable talking head models. Code and data can be found at https://github.com/FreedomIntelligence/TalkVid.

Data Construction Pipeline

Data Construction Pipeline

The TalkVid construction pipeline. The process starts with (1) video collection and clip segmentation. Each candidate clip then undergoes (2) a multi-stage filtering cascade to enforce quality across aesthetics, motion, and facial detail. Finally, the pipeline’s effectiveness is (3) validated against human judgments.

Data Statistics

Data Statistics

Statistical distributions of the TalkVid dataset. Top row: technical quality metrics for the final, filtered dataset. Bottom row: distributions of the high-level characteristics, including video categories, language, and speaker demographics.

Comparison with Open-Source Datasets

Comparison with Open-Source Datasets

Comparison of open-source datasets for audio-driven talking-head generation. TalkVid contains 7,729 speakers, 1,244 hours of video, and resolutions up to 2160p, covering 15 languages and ages 0–60+, while uniquely providing both full-body content and captions. These characteristics establish TalkVid as the most comprehensive and diverse open-source dataset in this domain.

Video Demonstrations Across Four Dimensions

Language

Arabic

Chinese

English

Hindi

Korean

Polish

Portuguese

Thai


Ethnicity

African #001

African #002

African #003

White #001

White #002

Asian #001

Asian #002

Asian #003


Gender

Male #001

Male #002

Male #003

Male #004

Female #001

Female #002

Female #003

Female #004


Age

19-30 #001

19-30 #002

31-45 #001

31-45 #002

46-60 #001

46-60 #002

60+ #001

60+ #002

TalkVid Bench

1. Benchmark Design

TalkVid-Bench comprises 500 carefully sampled and stratified video clips along four critical demographic and language dimensions: age, gender, ethnicity, and language. This stratified design enables granular analysis of model performance across diverse subgroups, mitigating biases hidden in traditional aggregate evaluations. Each dimension is divided into balanced categories:

  • Age: 0–19, 19–30, 31–45, 46–60, 60+, with a total of 105 samples.
  • Gender: Male, Female, with a total of 100 samples.
  • Ethnicity: Black, White, Asian, with a total of 100 samples.
  • Language: English, Chinese, Arabic, Polish, German, Russian, French, Korean, Portuguese, Japanese, Thai, Spanish, Italian, Hindi, and Other languages, with a total of 195 samples.

2. Linguistic and Demographic Distributions

Linguistic and Demographic Distributions

Distribution of TalkVid-bench across the language dimension (15 languages, 195 samples). Language abbreviations: ar (Arabic), pl (Polish), de (German), ru (Russian), fr (French), ko (Korean), pt (Portuguese), other (other languages), ja (Japanese), th (Thai), es (Spanish), it (Italian), hi (Hindi), en (English), zh (Chinese).

Data Statistics

Distribution of TalkVid-bench across three demographic dimensions. These statistics illustrate the diversity of TalkVid-bench in terms of participant demographics, providing a comprehensive benchmark for evaluating models under varied demographic conditions.

3. Case Study

Linguistic Distribution
Demographic Distribution

Case study on generation results from TalkVid-Bench. Left: qualitative examples from the TalkVid-trained model, evaluated on diverse samples spanning language, ethnicity, gender, and age. Right: qualitative comparison on an unseen clip from TalkVid-Bench, showing outputs from V-Express fine-tuned on HDTF, Hallo3, TalkVid-Core (ours), and the Ground Truth (GT).

BibTeX

@article{park2021nerfies,
  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
  title     = {Nerfies: Deformable Neural Radiance Fields},
  journal   = {ICCV},
  year      = {2021},
}