LLM Reasoning & Agentic RL

Reasoning and agentic RL

Policy optimization Agentic RL Path pruning Code reasoning Video reasoning

Benyou Wang papers Math and Optimization LLM Infra All Projects

LLM reasoning and agentic reinforcement learning project signal

This project organizes papers where reasoning is trained, optimized, pruned, or grounded in executable feedback. The scope covers policy optimization, role-aware agent RL, adaptive compute, code-integrated thinking, and multimodal R1-style learning.

Research Storyline

Optimize

Move from SFT to policy optimization

OnePO and QFFT study how to adapt models more directly, reducing dependence on heavy multi-stage pipelines while preserving exploration.

Act

Make RL objectives agent-aware

CRPO turns GRPO-style optimization toward role-playing agents, balancing task utility with persona fidelity and style consistency.

Spend

Allocate reasoning compute where it matters

STOP learns to prune doomed reasoning paths early, making parallel reasoning more accurate under a fixed compute budget.

Ground

Ground reasoning in code and perception

CoRT and Video-R1 extend reasoning feedback to executable computation and temporal multimodal understanding.

Paper Trail

OnePO: Direct One-stage Policy Optimization for SFT-free Domain Adaptation

Directly optimizes policies for domain adaptation without a separate SFT stage.

Publication list

Role

CRPO: Character-centric Group Relative Policy Optimization

Adapts group-relative policy optimization to role-aware reasoning agents.

Paper

Adapt

Question-Free Fine-Tuning

Studies efficient and adaptive reasoning fine-tuning for large language models.

Paper

Prune

Cut Your Losses!

Learns internal signals that prune low-value reasoning paths early during parallel reasoning.

Paper

Code

CoRT: Code-integrated Reasoning within Thinking

Lets models use executable computation as part of the reasoning process.

Paper

Video

Video-R1: Reinforcing Video Reasoning in MLLMs

Applies R1-style reinforcement learning to temporal video reasoning in multimodal LLMs.

Paper

Project Clusters

Policy optimization

OnePO, CRPO, and QFFT organize RL-style post-training around direct optimization, role fidelity, and efficient adaptation.

Adaptive compute

STOP and UPFT focus on spending less compute without losing reasoning quality, especially for long or parallel reasoning traces.

Tool and code grounding

CoRT connects reasoning to code execution, making intermediate calculations easier to verify and debug.

Multimodal reasoning RL

Video-R1 extends reinforcement learning for reasoning beyond text-only math into temporal video understanding.

Resource Map

STOP / Cut Your Losses

Project page for early pruning in efficient parallel reasoning.

Project page

Video-R1

Code and datasets for R1-style video reasoning.

Repository

CoRT

Code-integrated reasoning resources.

Repository

Back to Projects