LongLLaVA and MileBench

Long-context multimodal AI

LongLLaVA MileBench Many-image reasoning Video understanding Efficient MLLMs

LongLLaVA MileBench LongLLaVA paper MileBench paper

LongLLaVA asks how multimodal LLMs can reason over hundreds or thousands of images without exploding memory or latency. MileBench provides the benchmark pressure: long-context visual QA, retrieval, counting, ordering, and video tasks that expose whether a model truly uses long visual context.

Research Storyline

Benchmark

MileBench defines the pressure

Many-image and video tasks expose whether a multimodal model can retrieve, count, order, and reason across long visual contexts.

Model

LongLLaVA scales the context window

The model combines hybrid architecture, data construction, and progressive training so a single model can handle many images efficiently.

Efficiency

TRIM attacks token waste

Token reduction is the practical counterpart of long-context modeling: longer context only matters if inference remains affordable.

Agents

Phone and web agents need this capability

Real agents inspect many screenshots, browser states, UI panels, and videos; long-context multimodal reasoning becomes infrastructure for agentic workflows.

Technical Shape

Hybrid language backbone

The model combines Transformer-style reasoning with sequence-efficient components, targeting long visual contexts that ordinary MLLM attention struggles to scale.

Progressive multimodal training

Training moves from single-image alignment to multi-image and video-style instruction tuning so the model can retain both ordinary VQA ability and long-context behavior.

Benchmark-driven iteration

MileBench makes progress measurable across long visual sequences, multi-image retrieval, temporal context, and mixed image-video reasoning tasks.

Display Figures

LongLLaVA hybrid architecture — The architecture story: scale multimodal context without letting attention cost dominate the whole model.

LongLLaVA progressive training — The training story: retain single-image ability while teaching the model to reason over many images and video frames.

PhoneHarness mobile agent architecture — Downstream agent settings, such as phone control, turn long visual context into an operational requirement.

Why It Matters

Real multimodal agents need to inspect many screenshots, frames, pages, or documents, not just answer about a single image.
Long visual context is a systems problem: architecture, data mixture, token reduction, and evaluation all have to co-evolve.
The LongLLaVA/MileBench pairing creates both a model direction and a public yardstick for future long-context MLLMs.

Paper Trail

Benchmark

MileBench: Benchmarking MLLMs in Long Context

Creates the public pressure test for many-image and video reasoning, making long-context claims measurable.

Dataset

Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Responds to the benchmark pressure with a hybrid architecture and progressive multimodal training recipe.

Repository

Efficiency

TRIM and efficient multimodal context

Complements LongLLaVA by reducing redundant visual tokens so long-context systems can be deployed more cheaply.

TRIM

Resource Map

LongLLaVA

Model architecture, training recipe, and release material for long-context multimodal understanding.

Repository

MileBench dataset

Long-context multimodal benchmark data for evaluating image, video, and cross-context capabilities.

Dataset

Related efficiency line

TRIM and token-reduction work connect to the same goal: make multimodal context longer without making inference unusable.

TRIM