LongLLaVA and MileBench
Long-context multimodal models and benchmarks for reasoning over many images and videos.
LongLLaVA asks how multimodal LLMs can reason over hundreds or thousands of images without exploding memory or latency. MileBench provides the benchmark pressure: long-context visual QA, retrieval, counting, ordering, and video tasks that expose whether a model truly uses long visual context.
Research Storyline
Many-image and video tasks expose whether a multimodal model can retrieve, count, order, and reason across long visual contexts.
The model combines hybrid architecture, data construction, and progressive training so a single model can handle many images efficiently.
Token reduction is the practical counterpart of long-context modeling: longer context only matters if inference remains affordable.
Real agents inspect many screenshots, browser states, UI panels, and videos; long-context multimodal reasoning becomes infrastructure for agentic workflows.
Technical Shape
The model combines Transformer-style reasoning with sequence-efficient components, targeting long visual contexts that ordinary MLLM attention struggles to scale.
Training moves from single-image alignment to multi-image and video-style instruction tuning so the model can retain both ordinary VQA ability and long-context behavior.
MileBench makes progress measurable across long visual sequences, multi-image retrieval, temporal context, and mixed image-video reasoning tasks.
Display Figures
Why It Matters
- Real multimodal agents need to inspect many screenshots, frames, pages, or documents, not just answer about a single image.
- Long visual context is a systems problem: architecture, data mixture, token reduction, and evaluation all have to co-evolve.
- The LongLLaVA/MileBench pairing creates both a model direction and a public yardstick for future long-context MLLMs.
Paper Trail
Creates the public pressure test for many-image and video reasoning, making long-context claims measurable.
DatasetResponds to the benchmark pressure with a hybrid architecture and progressive multimodal training recipe.
RepositoryComplements LongLLaVA by reducing redundant visual tokens so long-context systems can be deployed more cheaply.
TRIMResource Map
Model architecture, training recipe, and release material for long-context multimodal understanding.
RepositoryLong-context multimodal benchmark data for evaluating image, video, and cross-context capabilities.
DatasetTRIM and token-reduction work connect to the same goal: make multimodal context longer without making inference unusable.
TRIM