RAG and Instruction Data Infrastructure

Open data infrastructure

RAG-Instruct LLMZoo InstructionZoo Huatuo-26M ApolloCorpus

RAG-Instruct LLMZoo InstructionZoo Dataset

FreedomIntelligence open source repository overview

Many FreedomAI projects are useful because they release the substrate that other builders need: instruction data, domain corpora, retrieval-augmented tasks, medical QA datasets, multimodal feedback data, and reproducible model checkpoints.

Research Storyline

Collect

Build reusable domain corpora

Huatuo-26M, ApolloCorpus, PubMedVision, and related releases give medical and multilingual projects a stronger data substrate.

Ground

Teach models to use retrieved evidence

RAG-Instruct turns retrieval into an instruction-following skill rather than a post-hoc wrapper around memorized model answers.

Package

Release models with data and tasks

LLMZoo, InstructionZoo, and Hugging Face checkpoints make the data useful to people who want to reproduce or extend the recipe.

Feed back

Connect infrastructure to benchmarks

Datasets are most useful when paired with CMB, Apollo evaluation, multimodal medical benchmarks, and downstream project pages.

Infrastructure Pieces

RAG-Instruct

Retrieval-augmented instruction data designed to teach models how to use retrieved evidence rather than only memorize parametric knowledge.

LLMZoo and InstructionZoo

Curated model, data, evaluation, and instruction-tuning resources that made the lab's early open LLM ecosystem easier to reproduce.

Domain corpora

Huatuo-26M, ApolloCorpus, PubMedVision, and other datasets provide domain-specific supervision for medicine, multilingual AI, and visual reasoning.

Model releases

Small and mid-size checkpoints such as RAG-Instruct-Llama3-3B let researchers test data recipes without rebuilding the whole pipeline.

Paper Trail

Medical data

Huatuo-26M

Large-scale Chinese medical QA data that feeds medical LLM instruction tuning and evaluation.

Repository

RAG

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Teaches models how to condition answers on retrieved evidence across diverse tasks.

Repository

Multilingual

ApolloCorpus and PubMedVision

Domain corpora for multilingual medical modeling and medical visual-language supervision.

ApolloCorpus

Design Principles

Data releases should be paired with model checkpoints, paper context, and task definitions so they are usable rather than merely downloadable.
Retrieval-augmented instruction data should stress evidence use, citation-sensitive reasoning, and answer grounding.
Domain corpora should connect to benchmarks so downstream work can measure whether adaptation helped.

Resource Map

RAG-Instruct model

A Llama3-based checkpoint trained with retrieval-augmented instruction data.

Model

Huatuo-26M

Large Chinese medical QA data used across medical LLM research.

Repository

ApolloCorpus

Multilingual medical corpus for the Apollo model family.

Dataset