Preliminary Results on Building Graphs from SAEs

Published: Updated:
Author: Zach Maas

❗ This is a public draft. Content may change and is not complete.

TLDR:

  • I use nodewise LASSO to estimate conditional dependence graphs over SAE features, with resampling and null controls
  • Initial experiments produce graphs with small standalone modules that are stable under resampling
  • These modules frequently correspond to coherent linguistic features, and are only weakly aligned with cosine similarity

Motivation

In practice, SAE features exhibit redundancy, which we see in phenomena like feature redundancy, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights and correlation between activations. These can capture some similarity between features, but don’t model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.

Methodology

Run nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.

A diagram of the graph SAE method

I’ve built a pipeline to test this theory. My approach is as follows:

  1. Sample activations from a SAE at a given layer on random sequences
  2. Pre-screen candidate neighbors for each SAE feature
    1. Calculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discovery
    2. Keep the top  k~k candidates per node
  3. For each SAE feature:
    1. Perform a parameter sweep to tune the LASSO λ\lambda regularization coefficient
    2. Run LASSO to identify conditionally dependent neighbors for this feature
  4. Merge edges from each nodewise sweep, keeping bidirectional edges only
  5. Repeat under random sampling to test stability - I do 30 resamples + a matched null trial (feature-wise shuffle) for each resample
  6. Identify edges that appear consistently across resamples and pass BH-FDR vs null trials

This procedure approximates a conditional-dependence graph, but is not an exact estimator.

Initial Results

Small scale trials identify linguistically coherent modules that are only weakly aligned with cosine similarity.

I’ve successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b. Here, I’m presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I’ve tested so far. Sequences are randomly sampled from fineweb.

ItemValue
Base Modelgemma2:27b
SAE Modelgemmascope, layer_10/width_131k
Datasetfineweb
Activation samples per feature200
Activation sparsity (mean across 30 trials)0.99897
Average features per token (mean across 30 trials)133.7

For downstream analysis, I keep edges from our repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with q=0.1q=0.1 as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at k=2k*=2 replicates, with 57,047 retained edges.

Repeat Trial Threshold kObserved EdgesExpected Null EdgesEstimated FDR
1204,609196,2739.59e-01
257,0471,9813.47e-02
331,4291103.50e-03
515,16516.59e-05
105,28600.00e+00
201,38900.00e+00
3030200.00e+00

Survival curve of edges over 30 trials vs null expectation

Note that for this preliminary trial I’m using only a single null resampling per trial.

Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the n=30n=30 set of stable edges).

Graph showing cosine similarity vs LASSO strength between nodes

For qualitative interpretation here, I restrict manual inspection to the much smaller k=30k=30 subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components (n=510n=5-10 nodes), and a large number of tiny components. The group of tiny components (n<5n<5) appears to mostly be duplicated features with strong cosine similarity, and so I’ve filtered those elements out.

Small components

The smaller standalone components with n>4n>4 features look like linguistically coherent features (mostly grammar and context related)

  • Component 2 (9 nodes): possessive pronouns + some context. Component 002
  • Component 3 (9 nodes): “to be” + some descriptive context Component 003
  • Component 4 (7 nodes): “which/who/that” followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means) Component 004
  • Component 5 (5 nodes): “has been” either alone or followed by specific words Component 005
  • Component 6 (5 nodes): negations (not, didn’t) + verb and article context Component 006

Large component

The large connected cluster contains subsets of communities that can be identified with community detection. Note that for the sake of legibility I don’t label nodes in the following graph of all clusters. My current read is that these communities are not as linguistically “clean” as the standalone components --- many have a node that doesn’t fit the rest of the “theme” of the community.

The Large Connected Cluster

For example:

  • Cluster 8: proper names / place names / labels. Cluster 008 with 5 nodes
  • Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens. Cluster 004 with 11 nodes
  • Cluster 7: “the + noun/category” determiner phrase patterns. Cluster 007 with 5 nodes
  • Cluster 2/3: connective / clause / punctuation-ish syntax groupings. Cluster 002 with 12 nodes Cluster 003 with 11 nodes

In addition, we also get a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names. Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number ‘2’ as a prefix in different contexts.

Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure rather than a standalone finding separate from correlation.

Caveats and Limitations

There are several limitations to this approach I think you should keep in mind:

  • This isn’t finding a true precision graph, because SAE activations are not Gaussian
  • The output graph is dependent on correlations from the pre-screening step.
  • What I’m calling “stable” here means stability under resampling over the same dataset, which doesn’t say anything about underlying feature faithfulness.
  • At the current FDR threshold, k=30k=30 stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone.
  • As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.

Next steps

Right now, I see two primary directions this project needs to take:

  1. Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge k=2k=2 repeat set and (b) remove some of the current design’s hyperparameter dependence
  2. Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual stream

Additionally, I’ll also be looking at:

  1. How different datsets may change the output graphs and how to control against that
  2. Where this might land in relation to feature splitting and absorption
  3. What, if any, hierarchical structure between features we might be able to infer from these graphs

Funding Note: This work is funded by a Coefficient Giving TAIS grant.