RAG and Instruction Data Infrastructure

Open data, instruction tuning, and retrieval-augmented resources for downstream LLM builders.

Open data infrastructure
RAG-Instruct LLMZoo InstructionZoo Huatuo-26M ApolloCorpus
FreedomIntelligence open source repository overview

Many FreedomAI projects are useful because they release the substrate that other builders need: instruction data, domain corpora, retrieval-augmented tasks, medical QA datasets, multimodal feedback data, and reproducible model checkpoints.

Research Storyline

Collect
Build reusable domain corpora

Huatuo-26M, ApolloCorpus, PubMedVision, and related releases give medical and multilingual projects a stronger data substrate.

Ground
Teach models to use retrieved evidence

RAG-Instruct turns retrieval into an instruction-following skill rather than a post-hoc wrapper around memorized model answers.

Package
Release models with data and tasks

LLMZoo, InstructionZoo, and Hugging Face checkpoints make the data useful to people who want to reproduce or extend the recipe.

Feed back
Connect infrastructure to benchmarks

Datasets are most useful when paired with CMB, Apollo evaluation, multimodal medical benchmarks, and downstream project pages.

Infrastructure Pieces

RAG-Instruct

Retrieval-augmented instruction data designed to teach models how to use retrieved evidence rather than only memorize parametric knowledge.

LLMZoo and InstructionZoo

Curated model, data, evaluation, and instruction-tuning resources that made the lab's early open LLM ecosystem easier to reproduce.

Domain corpora

Huatuo-26M, ApolloCorpus, PubMedVision, and other datasets provide domain-specific supervision for medicine, multilingual AI, and visual reasoning.

Model releases

Small and mid-size checkpoints such as RAG-Instruct-Llama3-3B let researchers test data recipes without rebuilding the whole pipeline.

Paper Trail

Medical data
Huatuo-26M

Large-scale Chinese medical QA data that feeds medical LLM instruction tuning and evaluation.

Repository
RAG
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Teaches models how to condition answers on retrieved evidence across diverse tasks.

Repository
Multilingual
ApolloCorpus and PubMedVision

Domain corpora for multilingual medical modeling and medical visual-language supervision.

ApolloCorpus

Design Principles

  • Data releases should be paired with model checkpoints, paper context, and task definitions so they are usable rather than merely downloadable.
  • Retrieval-augmented instruction data should stress evidence use, citation-sensitive reasoning, and answer grounding.
  • Domain corpora should connect to benchmarks so downstream work can measure whether adaptation helped.

Resource Map

RAG-Instruct model

A Llama3-based checkpoint trained with retrieval-augmented instruction data.

Model
Huatuo-26M

Large Chinese medical QA data used across medical LLM research.

Repository
ApolloCorpus

Multilingual medical corpus for the Apollo model family.

Dataset