Medical Evaluation Benchmarks

Medical evaluation

CMB GMAI-MMBench LiveClin DxBench Doctor workflow

Medical AI cannot be judged only by generic chat quality. The lab's benchmark work builds domain-specific tests for Chinese medical knowledge, multimodal medical perception, live clinical leakage control, diagnostic reasoning, and doctor-centered workflows.

Evaluation Storyline

Knowledge

CMB starts from Chinese medical exams and clinical QA

It tests whether models know medicine in the language and assessment structure used by Chinese medical education and practice.

Dialogue

CMB-Clin adds complex consultation cases

Clinical question answering forces models to use patient histories and multi-turn information rather than isolated multiple-choice memory.

Vision

GMAI-MMBench moves evaluation into multimodality

Medical AI must combine images, reports, and clinical knowledge, so the benchmark stack expands beyond text-only questions.

Reality

LiveClin, DxBench, and workflow tests reduce benchmark comfort

Later evaluation directions stress leakage control, diagnostic reasoning, and doctor-centered tasks closer to deployed clinical workflows.

Benchmark Layers

CMB

A comprehensive Chinese medical benchmark covering medical exams, clinical QA, and Chinese medical knowledge, with code and Hugging Face data.

GMAI-MMBench

A multimodal benchmark for general medical AI, testing whether models can combine medical vision, text, and domain reasoning.

LiveClin and DxBench

Clinical evaluation directions that reduce leakage risk and stress diagnostic reasoning under more realistic clinical settings.

Workflow-aligned tasks

Doctor-centric evaluation reframes medical AI around tasks clinicians actually perform, rather than only static QA accuracy.

Display Figures

CMB benchmark structure — CMB combines exam-style medical knowledge with clinical diagnostic questions and expert-aligned scoring.

Medical model evaluation loop — Model building and benchmark design stay connected: evaluation reveals where medical adaptation still fails.

Evaluation Philosophy

Benchmarks should not reward fluent medical-sounding text alone. Strong medical evaluation asks whether the model uses evidence, handles uncertainty, follows workflow constraints, avoids memorized leakage, and supports the clinician's actual decision process.

Paper Trail

CMB

CMB: A Comprehensive Medical Benchmark in Chinese

Builds a Chinese medical benchmark with exam and clinical components, giving Chinese medical LLMs a domain-specific yardstick.

Repository

Vision

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Extends the evaluation stack to general medical AI scenarios that require multimodal perception and domain reasoning.

Project site

Clinical

LiveClin and DxBench

Push medical evaluation toward live, leakage-resistant, diagnostic, and workflow-aligned settings.

LiveClin

Resource Map

CMB code

Core Chinese medical benchmark repository and evaluation tooling.

Repository

CMB dataset

Hugging Face release for reproducible Chinese medical benchmark experiments.

Dataset

GMAI-MMBench

Multimodal general medical AI benchmark and public project resources.

Project