Summary of "Variant to Function (V2F) Symposium: Johanes Linder (2025)"
Summary of Johannes Linder’s Talk at the Variant to Function (V2F) Symposium (2025)
Johannes Linder presented on the development and application of machine learning models to predict gene regulatory functions directly from DNA sequence, focusing on interpreting genetic variation and its effects on gene regulation. His work is conducted within David Kelly’s group at Calico Life Sciences.
Main Ideas and Concepts
-
Goal: Use deep learning models to predict various gene regulatory functions from DNA sequence to better understand the impact of genetic variants, especially non-coding variants, on gene expression and regulation.
-
Challenge: The human genome’s regulatory code is complex, involving multiple layers such as chromatin accessibility, transcriptional activation, splicing, polyadenylation, and mRNA stability. Variants can have cell- and tissue-specific effects that are difficult to interpret.
-
Approach: Train high-capacity parametric machine learning models on large-scale molecular data to predict genome-wide sequencing coverage profiles (e.g., RNA-seq coverage) from sequence alone, enabling simulation of variant effects.
Methodology and Model Details
Borsoy Model
-
Purpose: Predicts raw RNA-seq coverage patterns at fine resolution (32 base pairs) from large genomic sequences (~500 kb).
-
Architecture:
- Input: Large DNA sequence
- Convolutional layers extract regulatory motifs
- Subsampling layers reduce sequence length
- Self-attention layers capture long-range interactions
- Upsampling layers with skip connections reconstruct fine-resolution coverage
-
Training Data: Human and mouse data including RNA-seq, epigenomic data, and CAGE data.
-
Output: Multiple one-dimensional coverage profiles representing gene expression and regulatory activity.
Interpretation Techniques
- Gradient backpropagation to compute sensitivity maps showing how sequence changes affect predicted RNA coverage.
- De novo motif discovery (using TF-MoDISco) on gradient scores recapitulates known tissue-specific transcription factor binding motifs.
- In silico saturation mutagenesis simulates effects of all possible mutations near a variant.
Applications
- Predict effects of fine-mapped causal eQTLs (expression quantitative trait loci) and polyadenylation QTLs.
- Distinguish causal variants from neutral variants better than previous models.
- Model changes in RNA isoform usage and polyadenylation patterns, not just total gene expression.
- Provide mechanistic insights into how variants alter regulatory motifs and gene expression.
Borsoy Prime Model
-
Extension: Predicts cell type-specific RNA-seq coverage profiles from single-cell RNA-seq data.
-
Training Data: Pseudobulk coverage profiles from multiple single-cell datasets (human and mouse).
-
Resolution: Predicts expression at the resolution of ~850 distinct human cell clusters.
-
Capabilities: Enables cell type-specific variant effect predictions.
-
Validation: Concordance with fine-mapped cell type-specific eQTL datasets.
-
Interpretation: Reveals cell type-specific regulatory motifs driving expression.
Key Lessons and Insights
- Predicting raw RNA-seq coverage profiles from sequence enables learning of multiple regulatory layers simultaneously.
- Machine learning models can serve as “simulation engines” to predict the impact of any genetic variant in isolation.
- Interpretation of model predictions reveals biologically meaningful regulatory motifs and mechanisms.
- Incorporating multiple data types (RNA-seq, epigenomics, CAGE) improves model performance.
- Cell type-specific modeling is critical for understanding variants with context-dependent effects.
- Ensemble modeling and increasing training data diversity help address uncertainty and improve predictions.
- Current models perform best on variants with large effect sizes; subtle effects remain challenging.
- Future improvements may come from integrating perturbation data (e.g., transcription factor knockouts) and leveraging more unique sequences, including from related species.
Discussion Highlights
- Models align well with large-effect variants validated by CRISPR but struggle with subtle effect variants.
- Multi-modal training (e.g., combining RNA-seq and ATAC-seq) boosts performance.
- Metadata on cell types and experimental conditions is important but may be partially imputed by AI models in the future.
- Comparison with newer models like DeepMind’s AlphaGenome is pending due to access restrictions.
- Potential integration of short-context models (e.g., MPRAs) with long-context models like Borsoy could be beneficial.
- Single-cell data and endogenous perturbation assays represent promising avenues for further model refinement.
Summary of Methodology / Instructions
Training a Gene Regulatory Prediction Model (e.g., Borsoy)
- Collect large-scale sequencing coverage datasets (RNA-seq, epigenomics, CAGE) across tissues/species.
- Prepare large genomic sequence inputs (~500 kb).
- Design a deep neural network with convolutional layers, subsampling, self-attention, and upsampling.
- Train to predict one-dimensional coverage profiles at fine resolution.
- Use alternating batches of human and mouse data for training.
- Validate on held-out genomic regions.
Interpreting Model Predictions
- Compute gradients of predicted coverage with respect to input sequence to identify important motifs.
- Perform de novo motif discovery on gradient maps.
- Conduct in silico saturation mutagenesis to assess variant effects.
- Cluster tissue- or cell type-specific motifs to understand regulatory grammar.
Extending to Cell Type-Specific Predictions
- Curate single-cell RNA-seq datasets and generate pseudobulk coverage profiles for cell clusters.
- Train a similar model (Borsoy Prime) on these data to predict cell type-specific expression.
- Use model outputs to interpret cell type-specific variant effects.
Speakers / Sources Featured
- Johannes Linder – Machine learning scientist at Calico Life Sciences, primary speaker.
- David Kelly – Principal Investigator at Calico Life Sciences, mentor and collaborator.
- Jesse – Mentioned as previous speaker (contextual reference).
- Millina – Questioner during Q&A.
- Brad – Questioner during Q&A.
- Liz – Mentioned in Q&A context.
Other Referenced Models and Groups
- DeepMind (Informer model, AlphaGenome)
- TF-MoDISco (motif discovery tool)
- Public consortia: ENCODE, GTEx, Phantom consortium
- Single-cell datasets: Tabula Sapiens, Tabula Muris, adult brain atlas, 1K1K PBMC dataset
This summary captures the core content, methodology, and discussion points from Johannes Linder’s presentation on predicting gene regulatory function from DNA sequence using deep learning.
Category
Educational