Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age-groups. We argue this generalization gap is a direct symptom of fundamental limitations in existing training data, which lack the necessary scale, quality, and diversity.
To address this foundational challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability.
Furthermore, to enable more rigorous and equitable evaluation, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our comprehensive experiments demonstrate that a SOTA model trained on TalkVid significantly outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization and fairness.
Crucially, our analysis on TalkVid-Bench reveals critical performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. We will release TalkVid and TalkVid-Bench to the community to catalyze the development of the next generation of robust, generalizable, and equitable talking head models. Code and data can be found at https://github.com/FreedomIntelligence/TalkVid.
The TalkVid construction pipeline. The process starts with (1) video collection and clip segmentation. Each candidate clip then undergoes (2) a multi-stage filtering cascade to enforce quality across aesthetics, motion, and facial detail. Finally, the pipeline’s effectiveness is (3) validated against human judgments.
Statistical distributions of the TalkVid dataset. Top row: technical quality metrics for the final, filtered dataset. Bottom row: distributions of the high-level characteristics, including video categories, language, and speaker demographics.
Comparison of open-source datasets for audio-driven talking-head generation. TalkVid contains 7,729 speakers, 1,244 hours of video, and resolutions up to 2160p, covering 15 languages and ages 0–60+, while uniquely providing both full-body content and captions. These characteristics establish TalkVid as the most comprehensive and diverse open-source dataset in this domain.
Arabic
Chinese
English
Hindi
Korean
Polish
Portuguese
Thai
African #001
African #002
African #003
White #001
White #002
Asian #001
Asian #002
Asian #003
Male #001
Male #002
Male #003
Male #004
Female #001
Female #002
Female #003
Female #004
19-30 #001
19-30 #002
31-45 #001
31-45 #002
46-60 #001
46-60 #002
60+ #001
60+ #002
TalkVid-Bench comprises 500 carefully sampled and stratified video clips along four critical demographic and language dimensions: age, gender, ethnicity, and language. This stratified design enables granular analysis of model performance across diverse subgroups, mitigating biases hidden in traditional aggregate evaluations. Each dimension is divided into balanced categories:
Distribution of TalkVid-bench across the language dimension (15 languages, 195 samples). Language abbreviations: ar (Arabic), pl (Polish), de (German), ru (Russian), fr (French), ko (Korean), pt (Portuguese), other (other languages), ja (Japanese), th (Thai), es (Spanish), it (Italian), hi (Hindi), en (English), zh (Chinese).
Distribution of TalkVid-bench across three demographic dimensions. These statistics illustrate the diversity of TalkVid-bench in terms of participant demographics, providing a comprehensive benchmark for evaluating models under varied demographic conditions.
Case study on generation results from TalkVid-Bench. Left: qualitative examples from the TalkVid-trained model, evaluated on diverse samples spanning language, ethnicity, gender, and age. Right: qualitative comparison on an unseen clip from TalkVid-Bench, showing outputs from V-Express fine-tuned on HDTF, Hallo3, TalkVid-Core (ours), and the Ground Truth (GT).
@article{park2021nerfies,
author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
title = {Nerfies: Deformable Neural Radiance Fields},
journal = {ICCV},
year = {2021},
}