Confidence: I've reviewed this several times, but I may go back and update details or add clarification.

One key question when working with genomic sequencing data is determining what sequence features drive biological behavior. The basic idea here is this - when working with many sequencing experiments, we have some sort of known underlying genome that a cell has, and then a readout of some other biological process from a sequencing experiment. Our goal is to figure out how the sequence drives our experimental readout. A contemporary approach to tackling this question is what I will call sequence to read models, which are typically neural networks constructed as follows:

Training: Genomic Sequence → Model → Reads (Experimental Data)

Using that model, some interpretability technique is then applied to the model:

Training: Genomic Sequence → Model → Reads (Experimental Data)
Interpret: Genomic Sequence ← Model ← Reads (Experimental Data)

Using the model backwards with attribution approaches is done in a few ways:

Attributions based on experimental data
Attributions on perturbations of experimental data
- Variations based on known underlying biology for validation
- Random variations to probe novel behavior learned by the model
Clustering of attributions (either of the above) to learn novel motifs (common patterns) in the sequence

Here, I'll cover my thoughts on training these sorts of models based on the work that I did at the end of my PhD, where I developed modeling approaches for transcriptional sequencing data.

If you're looking for a more academic treatment of this field, there is excellent work by Zeitliger and Sokolova. I think the canonical paper here is from Avsec. Much of this literature calls these models Sequence to Function (S2F), but I prefer to stay grounded in the actual reads that an experiment produces as opposed to a more abstract notion of function. Hence, S2R.

The Basic Model

The basic notion behind an S2R model is to take some sort of genomic sequence data, run it through a model, and train it to predict some sort of output assay. As with many ML tasks, this sort of architecture essentially boils down to 4 distinct components:

Data representation
A spatially aware model
Appropriate targets (loss functions) for describing the problem
Interpretability tools

Data Representation

The first thing that you'll have to decide when you're building this model is what your data input is going to look like, both in terms of size of inputs as well as other representational details.

With genomic sequencing models:

Data is usually 1-dimensional - the genome has positions on each chromosome that are sequential to each other, and commonly used sequencing protocols measure a count of some event (say, how many times we saw RNA matching that position in RNA-seq).
Base pairs are a categorical track of the underlying genome (A/T/C/G)
Protocols (usually) yield a numerical count for each track.

Data structure guides design:

How big are the features you're trying to describe in your genomic data, and how much do long range details matter?
On the smaller scale for something like transcriptional regulation, a few thousand base pairs might capture the behavior you're interested in. This is where much of the S2R research so far has happened.
On a larger scale with something like 3D regulatory interactions, you'll need in the range of 10k-10m features models become larger and more difficult to train.

Spatial Awareness

For nascent assays like GRO/PRO, most of the interesting signal is local: promoter-proximal pausing, bidirectional initiation, TF associated motifs, and other short-range features. You don’t need megabase contexts to capture that. Windows on the order of a few kb around transcription sites work well, and the model’s “sense of space” should focus on sub-kb to kb-scale structure.

Architecturally:

Convs handle motif-scale structure nicely and are compute-friendly at this window size. They’re a solid baseline.
In my runs, a transformer did better once I fused sequence and read channels. It captured motif–context interactions and strand-asymmetric patterns without me hard-coding them. Positional encodings matter: I prefer RoPE over classic sinusoids here; it behaved better in attributions (less periodic artifact leakage) and scales more gracefully if you extend context later.

Because I used a BERT-style setup, masking strategy was the big thing to tweak:

Random masking beats sequential masking for this domain. The genome isn’t a language with “next token” semantics. A mix of token and short span masking (15–40%) worked well for letting the model learn both local motifs and slightly longer context like motif spacing and the pausing signature downstream of the TSS.
If you corrupt bases, sample replacements from a realistic background (genome-wide or dataset-matched) rather than uniform A/C/G/T. Otherwise the model learns to detect corruption noise.

Appropriate Targets

Nascent reads (particularly -seq vs -cap) are stranded and spiky. You generally care about the shape (profile across positions, per strand) and the scale (how much total signal).

What worked best in practice:

Dual-head objectives: predict a normalized per-position profile (per strand) for experimental data and and a separate KL divergence term for base pair composition. Concretely, a softmax over positions per strand for shape works well at the cost of losing information on relative scale, while pure RMS loss is more robust to scale but less transferrable across samples.
Generalization across labs/samples: consider training on normalized depth (e.g., per-sample scaling or TPM-like) at the cost of absolute quantitation.
Balancing multi-task: Hand-tuned λ for profile vs scale works, but I’ve had more stable training with gradient-based balancing (e.g. GradNorm-style). Then your λ parameter acts like a “relative priority” knob rather than a magic constant tied to the architecture.
Be careful with auxiliary labels like cell type: if you feed them in and don’t mask aggressively, the model can just route information through them instead of learning the sequence–assay mapping.

Training the Model

This setup is very doable on a single GPU for kb-scale windows:

Precompute and cache fixed windows centered on candidate regions, with both sequence and experimental profiles (per strand). Use a chunked, memory-mapped format (HDF5, etc.) to keep I/O from becoming your limiter.
Shuffle across chromosomes and cell types so minibatches have enough variety of data.

A quirk I consistently saw with transformer S2R models: a two-phase loss curve. Early drop, a long slow plateau, then a second improvement phase. My read is that the model first nails some core aspects of reconstruction, then eventually adjusts to capture more subtle regulatory context. I didn’t see the same dynamics with CNNs.

Multi-modal encoder details:

I used separate encoders for sequence and read channels (lightweight conv for reads, small conv or linear for sequence), fused them, then ran a shared transformer with BERT-style masking. That outperformed a single “everything in one token stream” baseline in terms of both loss and interpretability. You can't use traditional dictionary style embeddings easily with downstream interpretability tools.
For cell type, a simple one-hot embedded via a linear layer and injected at the input (or after the first attention block) appeared to be useful. Random masking helps prevent trivial label passthrough, but more work is needed to validate that.

Interpretability Tools

With nascent data, you care about small scale features - broadly transcriptional regulatory elements and TF binding motifs, plus sequence biases. The usual interpretability stack holds up, with a couple tweaks:

Gradient-based attribution (Integrated Gradients, DeepSHAP) per strand lets you localize which bases drive the profile shape. Aggregation helps to pull out apparently meaningful patterns.
Clustering matters: Cluster either the attribution maps or internal embeddings first, then run motif discovery on each cluster. That’s how you surface the long tail of context-specific rules.

One practical constraint: internal embeddings from a reasonably wide model can be huge, and you’ll likely have a lot of windows. t-SNE/UMAP on large embeddings is slow. This is where using a SAE comes in handy.

The SAE Approach

To get compact, interpretable features out of giant embeddings, I train a sparse autoencoder (linear → ReLU → linear with L1, standard layout) on one chosen layer’s embeddings or end-to-end. The goal is a dictionary of more abstract features that the model uses across regions.

Training notes:

You want sparsity without feature collapse. Sweep sizes (e.g., 256 → 16k neurons) and sparsity strengths. Properly renitialize dead neurons and consider a target sparsity schedule to try to minimize feature splitting.
In practice, per-neuron activation distributions are roughly bell-shaped for a lot of neurons (or half-bell with e.g. ReLU). I calculate a z-score per neuron, pick “top examples” at, say, +3σ, and run attributions + TF-MoDISco on those subsets. If a neuron is heavy-tailed, switch to quantile thresholds.

For multi-cell-type training, a nice emergent property is that many SAE neurons end up cell-type-enriched. That gives you a clean way to map:

Neurons capturing general promoter logic (core promoter motifs, GC/CpG patterns).
Neurons capturing context-specific TF motifs or composite motifs enriched in one cell type. Cell-type specificity is mostly speculative at this point - I can go through an manually validate a subset of TFs as appearing to be cell-type specific, but wet-lab validation is needed for the much larger set of speculative cell-type specific TFs discovered.

Issues With Doing This

A few caveats specific to nascent assays and small transcription centered windows:

Positional artifacts vs biology: sinusoidal positional encodings can leak periodic structure that TF-MoDISco will happily “discover.” RoPE helped, but need better validation to make sure TF-MoDISco isn't learning data biases instead of
Sequence background bias: Transcription-centered windows have biased base composition. They’re enriched for local motifs and for the local GC composition of transcriptionally active regions. Controlling against this in some way is important.
Literature confirmation bias: many TFs discovered make a lot of sense - breast cancer markers in breast cancer cells. Many are not, so we should treat motif matches as hypotheses for validation, not conclusions.

Models I've Tested

The goal of this work here was to handle heterogeneity and expose cell-type-specific features in nascent transcription, not just maximize a single ChIP-like objective.

What didn’t pan out reliably:

BPNet-style CNNs (implicit/explicit cell type): sometimes good, but across diverse inputs the runs were less reproducible for me. Loss curves had odd non-stationarities; adding explicit cell type seemed to destabilize things further. Could be interaction with repeated windows across samples; either way, not robust enough for my taste.
Siamese/contrastive sequence↔read models: conceptually great (CLIP-like), and when they converge, the embeddings separate by cell type beautifully. But convergence was fickle—only a small fraction of runs hit a good basin. Batch composition, negative sampling, and temperature are touchy. Worth revisiting, but not the most recruiter-friendly story.

What did work:

Multi-modal dual-encoder + BERT-style masked autoencoder. Separate encoders for sequence and reads, fuse, then mask-and-reconstruct. Masking 15–40% (mix of tokens/spans) was a good default.
Optional explicit cell-type conditioning (one-hot through a small projection added at input or after the first attention block). I didn’t supervise on cell type in most of my work, and this needs more validation but looks good in initial tests.

Training was straightforward on a single A100 for all of the architectures tested, with good convergence on the order of 1m+ regions seen (duplicated across epochs, a typical experiment has 70-100k regions)

Open Questions

Disentangling model priors from biology. Can we standardize a control suite (alternate positional encodings, shuffled baselines, synthetic promoters with known logic) for nascent-S2R models?
Separating background composition from function: Bias correction helps at the motif discovery phase, but we could use better nulls or controls at the model training level.
Lab/paper specific effects: Experiments from different labs tend have signatures related to lab and/or person preparing the samples. How can we control against models learning that?
Improving on motif discovery: TF-MoDISco works well but tends to discovery long, evidently meaningless patterns in addition to known motifs. Some other approach tuned to transformer architectures might help?
Masking for more robustness: Can we design masking schemes that target features we don't want to learn (CpG islands, GC bias, sequence composition) to pressure the model towards interpretable features we care about?