❗This is a working draft. I’m confident in the results and it’s near complete, but there may be minor typos.

Summary:

Extending Nous Research’s CNA, I build sparse dependence graphs over contrastive refusal neurons
Graph-derived neuron subsets retain most or all of the steering behavior while using fewer neurons
Holding the set of neurons fixed, base and instruct models show near-complete graph rewiring
Graphs estimated on harmful-only activations sometimes outperform contrastive graphs

Background

I recently saw the paper Targeted Neuron Modulation via Contrastive Pair Search, which proposes a steering scheme called CNA that uses the top 0.1% of MLP neurons that distinguish contrastive prompts. They call this method Contrastive Neuron Attribution (CNA), and ablating this set of neurons dramatically reduces refusal behavior. Given work like Language Model Circuits Are Sparse in the Neuron Basis, I wondered if these neurons were independently useful units for steering, or if they have some internal organization. To test this, I built sparse conditional dependence graphs over CNA-identified neurons and looked at whether graph-derived subsets retained steering behavior.

For method reference on the generating dependence graph using nodewise LASSO, see my previous work. I apply this approach over all models discussed in the CNA paper - Llama and Qwen family models ranging from 0.5b to 72b. Of note, I generate graphs using the signed difference in contrast for dependence calculation, as opposed to the absolute difference that CNA uses to identify neuron sets.

Results

Graph-connected neurons frequently preserve steering.

Across multiple Qwen and Llama models, graph connected subsets frequently preserve most or all of the steering effect of the CNA method while compressing the neuron set. Steering sometimes fails (for example, in Qwen2.5 1.5B and 72B), but generally results in similar refusal changes using fewer MLP neurons. This suggests that refusal-relevant CNA neurons are not behaving as an entirely unstructured set of intervention targets.

Model	CNA # of Neurons	Graph # of Neurons	Refusal Rate - Base	Refusal Rate - CNA	Refusal Rate - Graph
Llama 3.2 1B	132	100	56%	13%	16%
Llama 3.2 3B	230	188	87%	80%	80%
Llama 3.1 8B	459	177	93%	74%	78%
Llama 3.1 70B	2,294	2,054	91%	63%	60%
Qwen 2.5 1.5B	251	2	99%	78%	95%
Qwen 2.5 3B	397	220	89%	22%	17%
Qwen 2.5 7B	531	274	90%	6%	4%
Qwen 2.5 72B	2,366	414	69%	9%	37%

Instruction tuning changes graph structure

The CNA paper highlights that ablation of the refusal neuron set does not work effectively in base models, and suggests that fine-tuning creates refusal behavior in these neurons. To assess what structure might be present, I generated graphs using the instruct-derived neuron set applied back to the base model. Holding the neuron set fixed but changing from instruct to base shows almost complete edge turnover. These graphs should be read as conditional dependence structure, not necessarily causal circuits. One possible interpretation here is that fine-tuning changes how refusal-relevant neurons coordinate rather than just changing participation in refusal.

model	instruct E	base E	shared E	edge Jaccard	base retention of instruct edges
Llama 3.2 1B	145	198	15	0.046	10.3%
Llama 3.2 3B	266	452	22	0.032	8.3%
Llama 3.1 8B	224	165	13	0.035	5.8%
Llama 3.1 70B	3,783	9,201	86	0.0067	2.3%
Qwen 2.5 1.5B	1	0	0	0.000	0.0%
Qwen 2.5 3B	284	269	47	0.093	16.5%
Qwen 2.5 7B	363	532	65	0.078	17.9%
Qwen 2.5 72B	468	651	58	0.055	12.4%

Caveat: Jaccard is a strict binary edge metric, so it can understate relatedness if weights shift, nearby substitute edges appear, or the LASSO solution is unstable. But by exact selected edge identity, yes, overlap is very low.

Llama 3.1 70B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 72B: full-depth CNA graph, base on top and instruct on bottom

Graphs built on harmful-only activations sometimes outperform contrastive graphs

Surprisingly, graphs estimated using the CNA set and harmful-only activations occasionally outperform graphs estimated from activation differences between harmful and non-harmful data. Both Llama 3.1 8b and Qwen 2.5 72b show this behavior - Llama 3.1 8b matches the CNA baseline and Qwen2.5 72b exceeds it with a 5% refusal rate. Many models also show effective refusal reductions on the harmful-only graphs. This may be a function of the structure of activations, where the CNA contrast between harmful/not identifies a useful working set of neurons and then harmful-only activations more reliably co-activate on refusal associated structure.

model	no intervention	harmful/non-harmful contrast graph	harmful-only graph
Llama 3.2 1B	56%	16%	17%
Llama 3.2 3B	86%	79%	79%
Llama 3.1 8B	93%	78%	74%
Llama 3.1 70B	91%	60%	71%
Qwen 2.5 1.5B	98%	94%	98%
Qwen 2.5 3B	90%	18%	29%
Qwen 2.5 7B	90%	4%	6%
Qwen 2.5 72B	69%	37%	5%

(table is formatted as refusal rate / rediction)

Future Work

I don’t have a convincing mechanistic explanation for why this approach works. My current view is that CNA and similar methods that act directly on the MLPs address the question of “which neurons matter here?”, while graph structure focuses on the organization of those neurons. Retaining behavioral changes on compressed subsets of neurons suggests that certain refusal-relevant neurons may coordinate more closely than others. The change in dependence structure before and after fine-tuning suggests that RL changes this coordination structure substantially. Because of this, I’m relatively optimistic about structured dependence as a useful interpretability primitive for future methods. In the best case, this approach seems like it may be useful for extracting compact behavior-associated bases directly from MLPs for steering and potential downstream RL tasks.

Recent work also seems to be pushing in this direction indirectly. Goodfire’s recent work seems to view sparse bases as local elements on manifolds, and structure dependence can be seen as a different lens on that structure. For subsequent work, I’d like to:

Dig more digging into the differences between base and instruct model structure, since many of the differences in results here seem to be model-family driven.
Testing whether this approach extends to capability suppression/unlearning in a steering or RL context.
Extending this beyond neurons - I’ve had good results with SAEs and MLP neurons, and wonder if other weights or maybe activation oracle latents might be informative.

Appendix

Graphs outperform random permutations of the CNA neuron set

As a control, I assessed whether the graph selection effect here was unique or could be captured using randomly equivalent-sized sets of nodes from the CNA set. There are 2 sets of nulls I ran: 30 layer-matched random subsets from the relevant top-CNA pool and 30 uniform random subsets from the same relevant pool. Random graph permutations can achieve similar refusal reduction to the graph-set, but the graph set is always close to the maximum reduction achieved by random permutations. Note that Llama 3.1 70B null runs are ongoing to get to n=60.

model	graph n	graph/CNA	graph refusal reduction	null set	n	null mean	null max	nulls >= graph
Llama-3.1-70B	2,054	90%	31%	layer-matched	3	30.0%	30.0%	0/3
Llama-3.1-70B	2,054	90%	31%	uniform	3	16.7%	20.0%	0/3
Llama-3.1-8B	421	92%	22%	layer-matched	30	18.4%	25.0%	6/30
Llama-3.1-8B	421	92%	22%	uniform	30	17.0%	24.0%	4/30
Llama-3.2-1B	100	76%	40%	layer-matched	30	44.0%	53.0%	21/30
Llama-3.2-1B	100	76%	40%	uniform	30	41.9%	58.0%	17/30
Llama-3.2-3B	188	82%	7%	layer-matched	30	4.8%	7.0%	1/30
Llama-3.2-3B	188	82%	7%	uniform	30	3.7%	8.0%	2/30
Qwen2.5-1.5B	2	1%	4%	layer-matched	30	-0.1%	0.0%	0/30
Qwen2.5-1.5B	2	1%	4%	uniform	30	-0.1%	0.0%	0/30
Qwen2.5-3B	220	55%	72%	layer-matched	30	50.4%	74.0%	2/30
Qwen2.5-3B	220	55%	72%	uniform	30	39.5%	71.0%	0/30
Qwen2.5-72B	627	27%	64%	layer-matched	30	3.7%	16.0%	0/30
Qwen2.5-72B	627	27%	64%	uniform	30	2.0%	16.0%	0/30
Qwen2.5-7B	274	52%	86%	layer-matched	30	63.3%	85.0%	0/30
Qwen2.5-7B	274	52%	86%	uniform	30	48.6%	74.0%	0/30

Layer-matching here means we hold the number of neurons sampled per layer equivalent to the graph, and uniform means we sample uniformly randomly. In both cases, we sample only from the CNA identifed top 0.1% set of neurons, Note that there is some variance in these nulls as size-matched random comparisons due to graph size variation. For most models, the identified graph is greater than 50% of the top-CNA neurons, so a disjoint size-matched random subset is not possible. In these cases, nulls therefore sample from the same top-CNA pool with overlap allowed, and ask whether the graph does better than typical same-size CNA subsets rather than whether it is uniquely necessary. The exceptions here are Qwen2.5-1.5b and Qwen2.5-72B, where nulls are drawn from outside the graph structure. I don’t believe that these null differences substantially change interpretation of results, but I am running matched trials to validate this as well.

Additional figures showing graph structure changes after fine tuning

Llama 3.2 1B: full-depth CNA graph, base on top and instruct on bottom Llama 3.2 3B: full-depth CNA graph, base on top and instruct on bottom Llama 3.1 8B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 1.5B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 3B: full-depth CNA graph, base on top and instruct on bottom Qwen 2.5 7B: full-depth CNA graph, base on top and instruct on bottom

Graph Structure Compresses Contrastive Neuron Steering Targets