LLM Interpretability and Sparse Autoencoders

Mechanistic interpretability

Sparse Autoencoders Circuit Analysis Model Unlearning Fine-Tuning Mechanisms

ICLR 2026 Paper EMNLP 2025 Paper ICML 2025 Paper

This project studies how to make LLM behavior legible and controllable. The line connects sparse autoencoder feature analysis, SAE-guided subspace projections for unlearning, and circuit-level studies of what fine-tuning changes inside a model.

Research Storyline

Utility

Test whether interpretability predicts usefulness

The SAE utility work compares interpretability scores against actual steering utility and studies when interpretable features are not the most effective control features.

Control

Use interpretable features for model unlearning

SAE-guided subspace projections turn feature-level evidence into constrained parameter updates, aiming for precise and robust removal of unwanted knowledge.

Circuits

Explain how fine-tuning changes computation

Circuit analysis tracks edge changes, subtask composition, and circuit-aware LoRA allocation during fine-tuning.

Key Papers

ICLR 2026

Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders

Xu Wang, Yan Hu, Benyou Wang, Difan Zou. Also received the Outstanding Paper Award at the NeurIPS 2025 ResponsibleFM Workshop.

Paper

EMNLP 2025

Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou. Studies SAE-guided subspace projection unlearning for controllable and robust knowledge removal.

Paper

ICML 2025

Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou. Uses circuit analysis to understand fine-tuning dynamics and circuit-aware adaptation.

Paper

Why It Matters

Interpretability should be validated against downstream utility, not treated as a proxy by default.
SAE features can support controllable model editing when connected to parameter-space objectives.
Circuit-level views make fine-tuning more inspectable and can guide more efficient adaptation methods.