Medical Evaluation Benchmarks

Benchmark infrastructure for Chinese medical QA, multimodal medical AI, live clinical testing, and doctor workflows.

Medical evaluation
CMB GMAI-MMBench LiveClin DxBench Doctor workflow
CMB benchmark overview

Medical AI cannot be judged only by generic chat quality. The lab's benchmark work builds domain-specific tests for Chinese medical knowledge, multimodal medical perception, live clinical leakage control, diagnostic reasoning, and doctor-centered workflows.

Evaluation Storyline

Knowledge
CMB starts from Chinese medical exams and clinical QA

It tests whether models know medicine in the language and assessment structure used by Chinese medical education and practice.

Dialogue
CMB-Clin adds complex consultation cases

Clinical question answering forces models to use patient histories and multi-turn information rather than isolated multiple-choice memory.

Vision
GMAI-MMBench moves evaluation into multimodality

Medical AI must combine images, reports, and clinical knowledge, so the benchmark stack expands beyond text-only questions.

Reality
LiveClin, DxBench, and workflow tests reduce benchmark comfort

Later evaluation directions stress leakage control, diagnostic reasoning, and doctor-centered tasks closer to deployed clinical workflows.

Benchmark Layers

CMB

A comprehensive Chinese medical benchmark covering medical exams, clinical QA, and Chinese medical knowledge, with code and Hugging Face data.

GMAI-MMBench

A multimodal benchmark for general medical AI, testing whether models can combine medical vision, text, and domain reasoning.

LiveClin and DxBench

Clinical evaluation directions that reduce leakage risk and stress diagnostic reasoning under more realistic clinical settings.

Workflow-aligned tasks

Doctor-centric evaluation reframes medical AI around tasks clinicians actually perform, rather than only static QA accuracy.

Display Figures

Evaluation Philosophy

Benchmarks should not reward fluent medical-sounding text alone. Strong medical evaluation asks whether the model uses evidence, handles uncertainty, follows workflow constraints, avoids memorized leakage, and supports the clinician's actual decision process.

Paper Trail

CMB
CMB: A Comprehensive Medical Benchmark in Chinese

Builds a Chinese medical benchmark with exam and clinical components, giving Chinese medical LLMs a domain-specific yardstick.

Repository
Vision
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Extends the evaluation stack to general medical AI scenarios that require multimodal perception and domain reasoning.

Project site
Clinical
LiveClin and DxBench

Push medical evaluation toward live, leakage-resistant, diagnostic, and workflow-aligned settings.

LiveClin

Resource Map

CMB code

Core Chinese medical benchmark repository and evaluation tooling.

Repository
CMB dataset

Hugging Face release for reproducible Chinese medical benchmark experiments.

Dataset
GMAI-MMBench

Multimodal general medical AI benchmark and public project resources.

Project