Banner Banner

Deep Learning Models of Regulatory DNA: A Comparison of Model Design Choices

Icon

February 20, 2026 Icon 11:00 - 12:00

Icon

Fraunhofer HHI, Lanolinfabrik building, Salzufer 15/16, 10587 Berlin

Icon

Prof. Anshul Kundaje

 

 

 

Abstract:

Chromatin state and gene expression is tightly regulated by proteins that interpret sequence syntax encoded in regulatory DNA. Genetic variants influencing traits and diseases often disrupt this syntax. Several deep learning models have been developed to decipher regulatory DNA and identify functional variants. Most models use supervised learning to map sequences to cell-specific regulatory activity measured by genome-wide molecular profiling experiments. The general trend in model design is towards larger, multi-task, supervised models with expansive receptive fields. Further, emerging self-supervised DNA language models (DNALMs) promise foundational representations for probing and fine-tuning on limited datasets. However, rigorous evaluations of these models against lightweight alternatives on biologically relevant tasks have been lacking. In this talk, I will demonstrate that lightweight deep learning models with a common architectural backbone that can (1) accurately predict diverse types of regulatory and transcriptional profiles, (2) detect and correct experimental biases, (3) robustly reveal underlying causal sequence syntax and its pleiotropy across biochemical and cellular contexts, (4) encode biophysical parameters and (5) predict effects of sequence perturbations including common and rare genetic variants. Our models are competitive with larger supervised models and significantly outperform fine-tuned self-supervised DNALMs on diverse downstream tasks. Additionally, we show that all popular multi-task, supervised models learn spurious non-causal predictive features that can impair counterfactual prediction, interpretation, and sequence design. Finally, we systematically interpret ~5,000 lightweight models trained on bulk and single-cell datasets spanning diverse fetal and adult contexts to expose the remarkable complexity and context-specificity of regulatory lexicons, syntax and variation encoded in the human genome.

 

© Anshul Kundaje

BIO:

Anshul Kundaje is Associate Professor of Genetics and Computer Science at Stanford University. The Kundaje lab develops machine learning models of gene regulation to decipher the genetic and molecular basis of disease. The lab has pioneered deep learning models and interpretation frameworks to decode the functional language encoded in DNA, RNA and proteins. Dr. Kundaje has led computational efforts of large genomics consortia including the ENCODE Project and the Roadmap Epigenomics Project. Dr. Kundaje is a recipient of the NIH Director's New Innovator Award, the Alfred Sloan Fellowship and the HUGO Chen Award of Excellence.