Graphical Activation Probes

TL;DR

Links

The Problem

Modern language models learn thousands of internal "features" that activate on different concepts. Current methods assume these features work independently, but that's often wrong. They interact in complex ways that reveal how the model actually thinks.

The Approach

Instead of analyzing features in isolation, I model their co-activation patterns as a graph:

Key insight: The graph topology tells us which concepts the model has learned to associate, even when those associations aren't obvious from the features themselves.

What It Finds

Example discovered connection: catastrophic events/overwhelming situations ↔ Fox News channel mentions

This suggests the model has learned to associate disaster-related content with specific news sources, a relationship that wouldn't show up in traditional feature analysis.

Technical Approach

Computational Challenges & Solutions

Why This Matters

Status

Tech Stack

What I Learned

Posted: 2025-08-06

Home