LLM Interpretability and Sparse Autoencoders

Mechanistic interpretability studies on sparse autoencoders, controllable unlearning, and fine-tuning circuits in large language models.

Mechanistic interpretability
Sparse Autoencoders Circuit Analysis Model Unlearning Fine-Tuning Mechanisms
LLM interpretability and sparse autoencoder analysis

This project studies how to make LLM behavior legible and controllable. The line connects sparse autoencoder feature analysis, SAE-guided subspace projections for unlearning, and circuit-level studies of what fine-tuning changes inside a model.

Research Storyline

Utility
Test whether interpretability predicts usefulness

The SAE utility work compares interpretability scores against actual steering utility and studies when interpretable features are not the most effective control features.

Control
Use interpretable features for model unlearning

SAE-guided subspace projections turn feature-level evidence into constrained parameter updates, aiming for precise and robust removal of unwanted knowledge.

Circuits
Explain how fine-tuning changes computation

Circuit analysis tracks edge changes, subtask composition, and circuit-aware LoRA allocation during fine-tuning.

Key Papers

ICLR 2026
Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders

Xu Wang, Yan Hu, Benyou Wang, Difan Zou. Also received the Outstanding Paper Award at the NeurIPS 2025 ResponsibleFM Workshop.

Paper
EMNLP 2025
Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou. Studies SAE-guided subspace projection unlearning for controllable and robust knowledge removal.

Paper
ICML 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou. Uses circuit analysis to understand fine-tuning dynamics and circuit-aware adaptation.

Paper

Why It Matters

  • Interpretability should be validated against downstream utility, not treated as a proxy by default.
  • SAE features can support controllable model editing when connected to parameter-space objectives.
  • Circuit-level views make fine-tuning more inspectable and can guide more efficient adaptation methods.