LongLLaVA and MileBench

Long-context multimodal models and benchmarks for reasoning over many images and videos.

Long-context multimodal AI
LongLLaVA MileBench Many-image reasoning Video understanding Efficient MLLMs
LongLLaVA architecture

LongLLaVA asks how multimodal LLMs can reason over hundreds or thousands of images without exploding memory or latency. MileBench provides the benchmark pressure: long-context visual QA, retrieval, counting, ordering, and video tasks that expose whether a model truly uses long visual context.

Research Storyline

Benchmark
MileBench defines the pressure

Many-image and video tasks expose whether a multimodal model can retrieve, count, order, and reason across long visual contexts.

Model
LongLLaVA scales the context window

The model combines hybrid architecture, data construction, and progressive training so a single model can handle many images efficiently.

Efficiency
TRIM attacks token waste

Token reduction is the practical counterpart of long-context modeling: longer context only matters if inference remains affordable.

Agents
Phone and web agents need this capability

Real agents inspect many screenshots, browser states, UI panels, and videos; long-context multimodal reasoning becomes infrastructure for agentic workflows.

Technical Shape

Hybrid language backbone

The model combines Transformer-style reasoning with sequence-efficient components, targeting long visual contexts that ordinary MLLM attention struggles to scale.

Progressive multimodal training

Training moves from single-image alignment to multi-image and video-style instruction tuning so the model can retain both ordinary VQA ability and long-context behavior.

Benchmark-driven iteration

MileBench makes progress measurable across long visual sequences, multi-image retrieval, temporal context, and mixed image-video reasoning tasks.

LongLLaVA training strategy

Display Figures

Why It Matters

  • Real multimodal agents need to inspect many screenshots, frames, pages, or documents, not just answer about a single image.
  • Long visual context is a systems problem: architecture, data mixture, token reduction, and evaluation all have to co-evolve.
  • The LongLLaVA/MileBench pairing creates both a model direction and a public yardstick for future long-context MLLMs.

Paper Trail

Benchmark
MileBench: Benchmarking MLLMs in Long Context

Creates the public pressure test for many-image and video reasoning, making long-context claims measurable.

Dataset
Model
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Responds to the benchmark pressure with a hybrid architecture and progressive multimodal training recipe.

Repository
Efficiency
TRIM and efficient multimodal context

Complements LongLLaVA by reducing redundant visual tokens so long-context systems can be deployed more cheaply.

TRIM

Resource Map

LongLLaVA

Model architecture, training recipe, and release material for long-context multimodal understanding.

Repository
MileBench dataset

Long-context multimodal benchmark data for evaluating image, video, and cross-context capabilities.

Dataset
Related efficiency line

TRIM and token-reduction work connect to the same goal: make multimodal context longer without making inference unusable.

TRIM