TL;DR

Novel approach to understanding how neural networks organize knowledge internally.
Uses graph theory to find hidden connections between learned features.
Reveals semantic relationships that linear methods miss (e.g., "catastrophic events" ↔ "Fox News").

The Problem

Modern language models learn thousands of internal "features" that activate on different concepts. Current methods assume these features work independently, but that's often wrong. They interact in complex ways that reveal how the model actually thinks.

The Approach

Instead of analyzing features in isolation, I model their co-activation patterns as a graph:

Nodes: Individual learned features (e.g., "mentions of disasters", "political commentary")
Edges: Statistical dependencies between features
Graph structure: Reveals semantic clusters and surprising connections

Key insight: The graph topology tells us which concepts the model has learned to associate, even when those associations aren't obvious from the features themselves.

What It Finds

Example discovered connection: catastrophic events/overwhelming situations ↔ Fox News channel mentions

This suggests the model has learned to associate disaster-related content with specific news sources, a relationship that wouldn't show up in traditional feature analysis.

Technical Approach

Built on Sparse Autoencoders (SAEs) trained on transformer activations
Graph construction via precision matrix estimation (captures conditional dependencies)
Subgraph clustering to identify semantic communities
Validation through controlled activation experiments

Computational Challenges & Solutions

Challenge: Precision matrix estimation is O(n³) in features
Approach: Subgraph decomposition + merging strategies
Alternative: Lighter approximation methods for edge detection
Target: Scale to 100k+ features while maintaining interpretability

Why This Matters

Interpretability: Understand how models organize knowledge beyond individual features
Safety: Identify unexpected concept associations that could cause problems, and give us another angle to probe deceptive behavior
Model improvement: Graph structure could inform better architectures

Status

Core method tested and scaling on smaller models
Awaiting compute funding (OpenPhil application pending) for large-scale experiments
Exploring computational optimizations for scaling

Tech Stack

PyTorch, HuggingFace Transformers
NetworkX, scikit-learn
Custom CUDA kernels for graph operations (planned)

What I Learned

Graph methods reveal structure that linear approaches don't see
Computational efficiency matters more than theoretical elegance for real applications
The intersection of interpretability and graph theory is promising.

Graphical Activation Probes