Research Notes: Training Sequence to Read Models

Written: 2024-12-03 | Last updated: 2025-09-29

Confidence: I've reviewed this several times, but I may go back and update details or add clarification.

One key question when working with genomic sequencing data is determining what sequence features drive biological behavior. The basic idea here is this - when working with many sequencing experiments, we have some sort of known underlying genome that a cell has, and then a readout of some other biological process from a sequencing experiment. Our goal is to figure out how the sequence drives our experimental readout. A contemporary approach to tackling this question is what I will call sequence to read models, which are typically neural networks constructed as follows:

Training: Genomic Sequence → Model → Reads (Experimental Data)

Using that model, some interpretability technique is then applied to the model:

Training: Genomic Sequence → Model → Reads (Experimental Data)
Interpret: Genomic Sequence ← Model ← Reads (Experimental Data)

Using the model backwards with attribution approaches is done in a few ways:

Here, I'll cover my thoughts on training these sorts of models based on the work that I did at the end of my PhD, where I developed modeling approaches for transcriptional sequencing data.

If you're looking for a more academic treatment of this field, there is excellent work by Zeitliger and Sokolova. I think the canonical paper here is from Avsec. Much of this literature calls these models Sequence to Function (S2F), but I prefer to stay grounded in the actual reads that an experiment produces as opposed to a more abstract notion of function. Hence, S2R.

The Basic Model

The basic notion behind an S2R model is to take some sort of genomic sequence data, run it through a model, and train it to predict some sort of output assay. As with many ML tasks, this sort of architecture essentially boils down to 4 distinct components:

  1. Data representation
  2. A spatially aware model
  3. Appropriate targets (loss functions) for describing the problem
  4. Interpretability tools

Data Representation

The first thing that you'll have to decide when you're building this model is what your data input is going to look like, both in terms of size of inputs as well as other representational details.

With genomic sequencing models:

Data structure guides design:

Spatial Awareness

For nascent assays like GRO/PRO, most of the interesting signal is local: promoter-proximal pausing, bidirectional initiation, TF associated motifs, and other short-range features. You don’t need megabase contexts to capture that. Windows on the order of a few kb around transcription sites work well, and the model’s “sense of space” should focus on sub-kb to kb-scale structure.

Architecturally:

Because I used a BERT-style setup, masking strategy was the big thing to tweak:

Appropriate Targets

Nascent reads (particularly -seq vs -cap) are stranded and spiky. You generally care about the shape (profile across positions, per strand) and the scale (how much total signal).

What worked best in practice:

Training the Model

This setup is very doable on a single GPU for kb-scale windows:

A quirk I consistently saw with transformer S2R models: a two-phase loss curve. Early drop, a long slow plateau, then a second improvement phase. My read is that the model first nails some core aspects of reconstruction, then eventually adjusts to capture more subtle regulatory context. I didn’t see the same dynamics with CNNs.

Multi-modal encoder details:

Interpretability Tools

With nascent data, you care about small scale features - broadly transcriptional regulatory elements and TF binding motifs, plus sequence biases. The usual interpretability stack holds up, with a couple tweaks:

One practical constraint: internal embeddings from a reasonably wide model can be huge, and you’ll likely have a lot of windows. t-SNE/UMAP on large embeddings is slow. This is where using a SAE comes in handy.

The SAE Approach

To get compact, interpretable features out of giant embeddings, I train a sparse autoencoder (linear → ReLU → linear with L1, standard layout) on one chosen layer’s embeddings or end-to-end. The goal is a dictionary of more abstract features that the model uses across regions.

Training notes:

For multi-cell-type training, a nice emergent property is that many SAE neurons end up cell-type-enriched. That gives you a clean way to map:

Issues With Doing This

A few caveats specific to nascent assays and small transcription centered windows:

Models I've Tested

The goal of this work here was to handle heterogeneity and expose cell-type-specific features in nascent transcription, not just maximize a single ChIP-like objective.

What didn’t pan out reliably:

What did work:

Training was straightforward on a single A100 for all of the architectures tested, with good convergence on the order of 1m+ regions seen (duplicated across epochs, a typical experiment has 70-100k regions)

Open Questions

Home