LLM Interpretability and Sparse Autoencoders
Mechanistic interpretability studies on sparse autoencoders, controllable unlearning, and fine-tuning circuits in large language models.
This project studies how to make LLM behavior legible and controllable. The line connects sparse autoencoder feature analysis, SAE-guided subspace projections for unlearning, and circuit-level studies of what fine-tuning changes inside a model.
Research Storyline
The SAE utility work compares interpretability scores against actual steering utility and studies when interpretable features are not the most effective control features.
SAE-guided subspace projections turn feature-level evidence into constrained parameter updates, aiming for precise and robust removal of unwanted knowledge.
Circuit analysis tracks edge changes, subtask composition, and circuit-aware LoRA allocation during fine-tuning.
Key Papers
Xu Wang, Yan Hu, Benyou Wang, Difan Zou. Also received the Outstanding Paper Award at the NeurIPS 2025 ResponsibleFM Workshop.
PaperXu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou. Studies SAE-guided subspace projection unlearning for controllable and robust knowledge removal.
PaperXu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou. Uses circuit analysis to understand fine-tuning dynamics and circuit-aware adaptation.
PaperWhy It Matters
- Interpretability should be validated against downstream utility, not treated as a proxy by default.
- SAE features can support controllable model editing when connected to parameter-space objectives.
- Circuit-level views make fine-tuning more inspectable and can guide more efficient adaptation methods.