RAG and Instruction Data Infrastructure
Open data, instruction tuning, and retrieval-augmented resources for downstream LLM builders.
Many FreedomAI projects are useful because they release the substrate that other builders need: instruction data, domain corpora, retrieval-augmented tasks, medical QA datasets, multimodal feedback data, and reproducible model checkpoints.
Research Storyline
Huatuo-26M, ApolloCorpus, PubMedVision, and related releases give medical and multilingual projects a stronger data substrate.
RAG-Instruct turns retrieval into an instruction-following skill rather than a post-hoc wrapper around memorized model answers.
LLMZoo, InstructionZoo, and Hugging Face checkpoints make the data useful to people who want to reproduce or extend the recipe.
Datasets are most useful when paired with CMB, Apollo evaluation, multimodal medical benchmarks, and downstream project pages.
Infrastructure Pieces
Retrieval-augmented instruction data designed to teach models how to use retrieved evidence rather than only memorize parametric knowledge.
Curated model, data, evaluation, and instruction-tuning resources that made the lab's early open LLM ecosystem easier to reproduce.
Huatuo-26M, ApolloCorpus, PubMedVision, and other datasets provide domain-specific supervision for medicine, multilingual AI, and visual reasoning.
Small and mid-size checkpoints such as RAG-Instruct-Llama3-3B let researchers test data recipes without rebuilding the whole pipeline.
Paper Trail
Large-scale Chinese medical QA data that feeds medical LLM instruction tuning and evaluation.
RepositoryTeaches models how to condition answers on retrieved evidence across diverse tasks.
RepositoryDomain corpora for multilingual medical modeling and medical visual-language supervision.
ApolloCorpusDesign Principles
- Data releases should be paired with model checkpoints, paper context, and task definitions so they are usable rather than merely downloadable.
- Retrieval-augmented instruction data should stress evidence use, citation-sensitive reasoning, and answer grounding.
- Domain corpora should connect to benchmarks so downstream work can measure whether adaptation helped.
Resource Map
A Llama3-based checkpoint trained with retrieval-augmented instruction data.
ModelLarge Chinese medical QA data used across medical LLM research.
RepositoryMultilingual medical corpus for the Apollo model family.
Dataset