Conditional Dependence Structure in Sparse Autoencoder Features

ICML 2026 Mechanistic Interpretability Workshop spotlight.
This work studies conditional dependence structure in sparse autoencoder features by estimating sparse graphs over SAE activations.

Paper

PDF	OpenReview PDF
Review page	OpenReview forum
Related note	Earlier research note

Summary

Sparse autoencoder features are often redundant or related in ways that are not fully captured by decoder cosine similarity or raw activation correlation. This work asks whether SAE activations contain structured conditional dependence, and whether approximate dependence graphs can reveal coherent groups of features.

My approach uses nodewise LASSO with resampling and null controls to build sparse feature graphs from random activations. Initial results identify stable graph structure, including small linguistically coherent modules, suggesting that dependence structure can be a useful lens on SAE feature organization.