TL;DR
- Novel approach to understanding how neural networks organize knowledge internally.
- Uses graph theory to find hidden connections between learned features.
- Reveals semantic relationships that linear methods miss (e.g., "catastrophic events" ↔ "Fox News").
Links
The Problem
Modern language models learn thousands of internal "features" that activate on different concepts. Current methods assume these features work independently, but that's often wrong. They interact in complex ways that reveal how the model actually thinks.
The Approach
Instead of analyzing features in isolation, I model their co-activation patterns as a graph:
- Nodes: Individual learned features (e.g., "mentions of disasters", "political commentary")
- Edges: Statistical dependencies between features
- Graph structure: Reveals semantic clusters and surprising connections
Key insight: The graph topology tells us which concepts the model has learned to associate, even when those associations aren't obvious from the features themselves.
What It Finds
Example discovered connection: catastrophic events/overwhelming situations ↔ Fox News channel mentions
This suggests the model has learned to associate disaster-related content with specific news sources, a relationship that wouldn't show up in traditional feature analysis.
Technical Approach
- Built on Sparse Autoencoders (SAEs) trained on transformer activations
- Graph construction via precision matrix estimation (captures conditional dependencies)
- Subgraph clustering to identify semantic communities
- Validation through controlled activation experiments
Computational Challenges & Solutions
- Challenge: Precision matrix estimation is O(n³) in features
- Approach: Subgraph decomposition + merging strategies
- Alternative: Lighter approximation methods for edge detection
- Target: Scale to 100k+ features while maintaining interpretability
Why This Matters
- Interpretability: Understand how models organize knowledge beyond individual features
- Safety: Identify unexpected concept associations that could cause problems, and give us another angle to probe deceptive behavior
- Model improvement: Graph structure could inform better architectures
Status
- Core method tested and scaling on smaller models
- Awaiting compute funding (OpenPhil application pending) for large-scale experiments
- Exploring computational optimizations for scaling
Tech Stack
- PyTorch, HuggingFace Transformers
- NetworkX, scikit-learn
- Custom CUDA kernels for graph operations (planned)
What I Learned
- Graph methods reveal structure that linear approaches don't see
- Computational efficiency matters more than theoretical elegance for real applications
- The intersection of interpretability and graph theory is promising.