Preliminary Results on Building Graphs from SAEs
❗ This is a public draft. Content may change and is not complete.
TLDR:
- I use nodewise LASSO to estimate conditional dependence graphs over SAE features, with resampling and null controls
- Initial experiments produce graphs with small standalone modules that are stable under resampling
- These modules frequently correspond to coherent linguistic features, and are only weakly aligned with cosine similarity
Motivation
In practice, SAE features exhibit redundancy, which we see in phenomena like feature redundancy, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights and correlation between activations. These can capture some similarity between features, but don’t model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.
Methodology
Run nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.
I’ve built a pipeline to test this theory. My approach is as follows:
- Sample activations from a SAE at a given layer on random sequences
- Pre-screen candidate neighbors for each SAE feature
- Calculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discovery
- Keep the top candidates per node
- For each SAE feature:
- Perform a parameter sweep to tune the LASSO regularization coefficient
- Run LASSO to identify conditionally dependent neighbors for this feature
- Merge edges from each nodewise sweep, keeping bidirectional edges only
- Repeat under random sampling to test stability - I do 30 resamples + a matched null trial (feature-wise shuffle) for each resample
- Identify edges that appear consistently across resamples and pass BH-FDR vs null trials
This procedure approximates a conditional-dependence graph, but is not an exact estimator.
Initial Results
Small scale trials identify linguistically coherent modules that are only weakly aligned with cosine similarity.
I’ve successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b. Here, I’m presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I’ve tested so far. Sequences are randomly sampled from fineweb.
| Item | Value |
|---|---|
| Base Model | gemma2:27b |
| SAE Model | gemmascope, layer_10/width_131k |
| Dataset | fineweb |
| Activation samples per feature | 200 |
| Activation sparsity (mean across 30 trials) | 0.99897 |
| Average features per token (mean across 30 trials) | 133.7 |
For downstream analysis, I keep edges from our repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at replicates, with 57,047 retained edges.
| Repeat Trial Threshold k | Observed Edges | Expected Null Edges | Estimated FDR |
|---|---|---|---|
| 1 | 204,609 | 196,273 | 9.59e-01 |
| 2 | 57,047 | 1,981 | 3.47e-02 |
| 3 | 31,429 | 110 | 3.50e-03 |
| 5 | 15,165 | 1 | 6.59e-05 |
| 10 | 5,286 | 0 | 0.00e+00 |
| 20 | 1,389 | 0 | 0.00e+00 |
| 30 | 302 | 0 | 0.00e+00 |
Note that for this preliminary trial I’m using only a single null resampling per trial.
Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of stable edges).
For qualitative interpretation here, I restrict manual inspection to the much smaller subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components ( nodes), and a large number of tiny components. The group of tiny components () appears to mostly be duplicated features with strong cosine similarity, and so I’ve filtered those elements out.
Small components
The smaller standalone components with features look like linguistically coherent features (mostly grammar and context related)
- Component 2 (9 nodes): possessive pronouns + some context.
- Component 3 (9 nodes): “to be” + some descriptive context
- Component 4 (7 nodes): “which/who/that” followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means)
- Component 5 (5 nodes): “has been” either alone or followed by specific words
- Component 6 (5 nodes): negations (not, didn’t) + verb and article context
Large component
The large connected cluster contains subsets of communities that can be identified with community detection. Note that for the sake of legibility I don’t label nodes in the following graph of all clusters. My current read is that these communities are not as linguistically “clean” as the standalone components --- many have a node that doesn’t fit the rest of the “theme” of the community.
For example:
- Cluster 8: proper names / place names / labels.
- Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens.
- Cluster 7: “the + noun/category” determiner phrase patterns.
- Cluster 2/3: connective / clause / punctuation-ish syntax groupings.
In addition, we also get a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names. Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number ‘2’ as a prefix in different contexts.
Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure rather than a standalone finding separate from correlation.
Caveats and Limitations
There are several limitations to this approach I think you should keep in mind:
- This isn’t finding a true precision graph, because SAE activations are not Gaussian
- The output graph is dependent on correlations from the pre-screening step.
- What I’m calling “stable” here means stability under resampling over the same dataset, which doesn’t say anything about underlying feature faithfulness.
- At the current FDR threshold, stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone.
- As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.
Next steps
Right now, I see two primary directions this project needs to take:
- Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge repeat set and (b) remove some of the current design’s hyperparameter dependence
- Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual stream
Additionally, I’ll also be looking at:
- How different datsets may change the output graphs and how to control against that
- Where this might land in relation to feature splitting and absorption
- What, if any, hierarchical structure between features we might be able to infer from these graphs
Funding Note: This work is funded by a Coefficient Giving TAIS grant.