Conditional Dependence Structure in Sparse Autoencoder Features
ICML 2026 Mechanistic Interpretability Workshop spotlight.
This work studies conditional dependence structure in sparse autoencoder features by estimating sparse graphs over SAE activations.
Paper
| OpenReview PDF | |
| Review page | OpenReview forum |
| Related note | Earlier research note |
Summary
Sparse autoencoder features are often redundant or related in ways that are not fully captured by decoder cosine similarity or raw activation correlation. This work asks whether SAE activations contain structured conditional dependence, and whether approximate dependence graphs can reveal coherent groups of features.
My approach uses nodewise LASSO with resampling and null controls to build sparse feature graphs from random activations. Initial results identify stable graph structure, including small linguistically coherent modules, suggesting that dependence structure can be a useful lens on SAE feature organization.