Graph Structure Compresses Contrastive Neuron Steering Targets

Published: Updated:
Author: Zach Maas

❗This is a working draft. I’m confident in the results and it’s near complete, but there may be minor typos.

Summary:

  • Extending Nous Research’s CNA, I build sparse dependence graphs over contrastive refusal neurons
  • Graph-derived neuron subsets retain most or all of the steering behavior while using fewer neurons
  • Holding the set of neurons fixed, base and instruct models show near-complete graph rewiring
  • Graphs estimated on harmful-only activations sometimes outperform contrastive graphs

Background

I recently saw the paper Targeted Neuron Modulation via Contrastive Pair Search, which proposes a steering scheme called CNA that uses the top 0.1% of MLP neurons that distinguish contrastive prompts. They call this method Contrastive Neuron Attribution (CNA), and ablating this set of neurons dramatically reduces refusal behavior. Given work like Language Model Circuits Are Sparse in the Neuron Basis, I wondered if these neurons were independently useful units for steering, or if they have some internal organization. To test this, I built sparse conditional dependence graphs over CNA-identified neurons and looked at whether graph-derived subsets retained steering behavior.

For method reference on the generating dependence graph using nodewise LASSO, see my previous work. I apply this approach over all models discussed in the CNA paper - Llama and Qwen family models ranging from 0.5b to 72b. Of note, I generate graphs using the signed difference in contrast for dependence calculation, as opposed to the absolute difference that CNA uses to identify neuron sets.

Results

Graph-connected neurons frequently preserve steering.

Across multiple Qwen and Llama models, graph connected subsets frequently preserve most or all of the steering effect of the CNA method while compressing the neuron set. Steering sometimes fails (for example, in Qwen2.5 1.5B and 72B), but generally results in similar refusal changes using fewer MLP neurons. This suggests that refusal-relevant CNA neurons are not behaving as an entirely unstructured set of intervention targets.

ModelCNA # of NeuronsGraph # of NeuronsRefusal Rate - BaseRefusal Rate - CNARefusal Rate - Graph
Llama 3.2 1B13210056%13%16%
Llama 3.2 3B23018887%80%80%
Llama 3.1 8B45917793%74%78%
Llama 3.1 70B2,2942,05491%63%60%
Qwen 2.5 1.5B251299%78%95%
Qwen 2.5 3B39722089%22%17%
Qwen 2.5 7B53127490%6%4%
Qwen 2.5 72B2,36641469%9%37%

Instruction tuning changes graph structure

The CNA paper highlights that ablation of the refusal neuron set does not work effectively in base models, and suggests that fine-tuning creates refusal behavior in these neurons. To assess what structure might be present, I generated graphs using the instruct-derived neuron set applied back to the base model. Holding the neuron set fixed but changing from instruct to base shows almost complete edge turnover. These graphs should be read as conditional dependence structure, not necessarily causal circuits. One possible interpretation here is that fine-tuning changes how refusal-relevant neurons coordinate rather than just changing participation in refusal.

modelinstruct Ebase Eshared Eedge Jaccardbase retention of instruct edges
Llama 3.2 1B145198150.04610.3%
Llama 3.2 3B266452220.0328.3%
Llama 3.1 8B224165130.0355.8%
Llama 3.1 70B3,7839,201860.00672.3%
Qwen 2.5 1.5B1000.0000.0%
Qwen 2.5 3B284269470.09316.5%
Qwen 2.5 7B363532650.07817.9%
Qwen 2.5 72B468651580.05512.4%

Caveat: Jaccard is a strict binary edge metric, so it can understate relatedness if weights shift, nearby substitute edges appear, or the LASSO solution is unstable. But by exact selected edge identity, yes, overlap is very low.

Llama 3.1 70B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 72B: full-depth CNA graph, base on top and instruct on bottom

Graphs built on harmful-only activations sometimes outperform contrastive graphs

Surprisingly, graphs estimated using the CNA set and harmful-only activations occasionally outperform graphs estimated from activation differences between harmful and non-harmful data. Both Llama 3.1 8b and Qwen 2.5 72b show this behavior - Llama 3.1 8b matches the CNA baseline and Qwen2.5 72b exceeds it with a 5% refusal rate. Many models also show effective refusal reductions on the harmful-only graphs. This may be a function of the structure of activations, where the CNA contrast between harmful/not identifies a useful working set of neurons and then harmful-only activations more reliably co-activate on refusal associated structure.

modelno interventionharmful/non-harmful contrast graphharmful-only graph
Llama 3.2 1B56%16%17%
Llama 3.2 3B86%79%79%
Llama 3.1 8B93%78%74%
Llama 3.1 70B91%60%71%
Qwen 2.5 1.5B98%94%98%
Qwen 2.5 3B90%18%29%
Qwen 2.5 7B90%4%6%
Qwen 2.5 72B69%37%5%

(table is formatted as refusal rate / rediction)

Future Work

I don’t have a convincing mechanistic explanation for why this approach works. My current view is that CNA and similar methods that act directly on the MLPs address the question of “which neurons matter here?”, while graph structure focuses on the organization of those neurons. Retaining behavioral changes on compressed subsets of neurons suggests that certain refusal-relevant neurons may coordinate more closely than others. The change in dependence structure before and after fine-tuning suggests that RL changes this coordination structure substantially. Because of this, I’m relatively optimistic about structured dependence as a useful interpretability primitive for future methods. In the best case, this approach seems like it may be useful for extracting compact behavior-associated bases directly from MLPs for steering and potential downstream RL tasks.

Recent work also seems to be pushing in this direction indirectly. Goodfire’s recent work seems to view sparse bases as local elements on manifolds, and structure dependence can be seen as a different lens on that structure. For subsequent work, I’d like to:

  • Dig more digging into the differences between base and instruct model structure, since many of the differences in results here seem to be model-family driven.
  • Testing whether this approach extends to capability suppression/unlearning in a steering or RL context.
  • Extending this beyond neurons - I’ve had good results with SAEs and MLP neurons, and wonder if other weights or maybe activation oracle latents might be informative.

Appendix

Graphs outperform random permutations of the CNA neuron set

As a control, I assessed whether the graph selection effect here was unique or could be captured using randomly equivalent-sized sets of nodes from the CNA set. There are 2 sets of nulls I ran: 30 layer-matched random subsets from the relevant top-CNA pool and 30 uniform random subsets from the same relevant pool. Random graph permutations can achieve similar refusal reduction to the graph-set, but the graph set is always close to the maximum reduction achieved by random permutations. Note that Llama 3.1 70B null runs are ongoing to get to n=60.

modelgraph ngraph/CNAgraph refusal reductionnull setnnull meannull maxnulls >= graph
Llama-3.1-70B2,05490%31%layer-matched330.0%30.0%0/3
Llama-3.1-70B2,05490%31%uniform316.7%20.0%0/3
Llama-3.1-8B42192%22%layer-matched3018.4%25.0%6/30
Llama-3.1-8B42192%22%uniform3017.0%24.0%4/30
Llama-3.2-1B10076%40%layer-matched3044.0%53.0%21/30
Llama-3.2-1B10076%40%uniform3041.9%58.0%17/30
Llama-3.2-3B18882%7%layer-matched304.8%7.0%1/30
Llama-3.2-3B18882%7%uniform303.7%8.0%2/30
Qwen2.5-1.5B21%4%layer-matched30-0.1%0.0%0/30
Qwen2.5-1.5B21%4%uniform30-0.1%0.0%0/30
Qwen2.5-3B22055%72%layer-matched3050.4%74.0%2/30
Qwen2.5-3B22055%72%uniform3039.5%71.0%0/30
Qwen2.5-72B62727%64%layer-matched303.7%16.0%0/30
Qwen2.5-72B62727%64%uniform302.0%16.0%0/30
Qwen2.5-7B27452%86%layer-matched3063.3%85.0%0/30
Qwen2.5-7B27452%86%uniform3048.6%74.0%0/30

Layer-matching here means we hold the number of neurons sampled per layer equivalent to the graph, and uniform means we sample uniformly randomly. In both cases, we sample only from the CNA identifed top 0.1% set of neurons, Note that there is some variance in these nulls as size-matched random comparisons due to graph size variation. For most models, the identified graph is greater than 50% of the top-CNA neurons, so a disjoint size-matched random subset is not possible. In these cases, nulls therefore sample from the same top-CNA pool with overlap allowed, and ask whether the graph does better than typical same-size CNA subsets rather than whether it is uniquely necessary. The exceptions here are Qwen2.5-1.5b and Qwen2.5-72B, where nulls are drawn from outside the graph structure. I don’t believe that these null differences substantially change interpretation of results, but I am running matched trials to validate this as well.

Additional figures showing graph structure changes after fine tuning

Llama 3.2 1B: full-depth CNA graph, base on top and instruct on bottom Llama 3.2 3B: full-depth CNA graph, base on top and instruct on bottom Llama 3.1 8B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 1.5B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 3B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 7B: full-depth CNA graph, base on top and instruct on bottom