LLM Reasoning & Agentic RL

Verifiable reasoning, agentic reinforcement learning, policy optimization, path pruning, code-integrated thinking, and multimodal R1-style training.

Reasoning and agentic RL
Policy optimization Agentic RL Path pruning Code reasoning Video reasoning
LLM reasoning and agentic reinforcement learning project signal

This project organizes papers where reasoning is trained, optimized, pruned, or grounded in executable feedback. The scope covers policy optimization, role-aware agent RL, adaptive compute, code-integrated thinking, and multimodal R1-style learning.

Research Storyline

Optimize
Move from SFT to policy optimization

OnePO and QFFT study how to adapt models more directly, reducing dependence on heavy multi-stage pipelines while preserving exploration.

Act
Make RL objectives agent-aware

CRPO turns GRPO-style optimization toward role-playing agents, balancing task utility with persona fidelity and style consistency.

Spend
Allocate reasoning compute where it matters

STOP learns to prune doomed reasoning paths early, making parallel reasoning more accurate under a fixed compute budget.

Ground
Ground reasoning in code and perception

CoRT and Video-R1 extend reasoning feedback to executable computation and temporal multimodal understanding.

Paper Trail

RL
OnePO: Direct One-stage Policy Optimization for SFT-free Domain Adaptation

Directly optimizes policies for domain adaptation without a separate SFT stage.

Publication list
Role
CRPO: Character-centric Group Relative Policy Optimization

Adapts group-relative policy optimization to role-aware reasoning agents.

Paper
Adapt
Question-Free Fine-Tuning

Studies efficient and adaptive reasoning fine-tuning for large language models.

Paper
Prune
Cut Your Losses!

Learns internal signals that prune low-value reasoning paths early during parallel reasoning.

Paper
Code
CoRT: Code-integrated Reasoning within Thinking

Lets models use executable computation as part of the reasoning process.

Paper
Video
Video-R1: Reinforcing Video Reasoning in MLLMs

Applies R1-style reinforcement learning to temporal video reasoning in multimodal LLMs.

Paper

Project Clusters

Policy optimization

OnePO, CRPO, and QFFT organize RL-style post-training around direct optimization, role fidelity, and efficient adaptation.

Adaptive compute

STOP and UPFT focus on spending less compute without losing reasoning quality, especially for long or parallel reasoning traces.

Tool and code grounding

CoRT connects reasoning to code execution, making intermediate calculations easier to verify and debug.

Multimodal reasoning RL

Video-R1 extends reinforcement learning for reasoning beyond text-only math into temporal video understanding.

Resource Map

STOP / Cut Your Losses

Project page for early pruning in efficient parallel reasoning.

Project page
Video-R1

Code and datasets for R1-style video reasoning.

Repository
CoRT

Code-integrated reasoning resources.

Repository