Summary & Conclusions

Key findings from comprehensive benchmarking of perturbation prediction models

A systematic comparison of 12 models + 3 baselines reveals that model performance depends strongly on perturbation effect size and evaluation perspective

Figure 5: Summary Results

Figure 5: Summary of model performance across all tasks and evaluation metrics

Key Findings

🎯 Model Ranking is Not Universal

Model performance ranking shifts dramatically based on perturbation effect size, dataset properties, and evaluation metric. No single model dominates across all scenarios.

📊 Effect Size Determines Metric Success

Small perturbation effects: Easier to predict with direct expression metrics (R², Pearson correlation)
Large perturbation effects: Better captured by delta-based metrics and differential expression recovery

🏆 Task-Specific Model Advantages

Task 1 & 2:

PerturbNet dominates differential expression recovery (56% overlap accuracy) due to distribution-aware modeling

Task 3:

scGEN performs best for cell transfer, but no method achieves satisfactory DE prediction across cell types

Foundation Models:

Show variance compression - underestimate response heterogeneity despite large-scale pretraining

🔬 Dataset Properties Shape Difficulty

Performance variability arises primarily from dataset and perturbation identity rather than model architecture. Cell-type transfer success depends on biological similarity and conserved response programs.

Model Recommendations by Use Case

Scenario Recommended Model Reason Performance
Differential Expression Recovery PerturbNet Distribution-aware architecture captures full perturbation effects 56% DE overlap accuracy
General Performance (R²) GEARS Gene regulatory network integration R² = 0.866 (Task 1)
Delta Direction Accuracy Biolord / scFoundation Effective at capturing perturbation directions 83.9% / 89.8% accuracy
Cell Transfer scGEN Best cross-cell-type generalization R² = 0.772, 88.1% delta accuracy
Combinatorial Perturbations scGPT / GEARS Handle interaction effects effectively R² = 0.784 / 0.783

Limitations & Future Directions

Current Limitations

  • No real ground truth: Single-cell sequencing is destructive
  • Limited temporal dynamics: Only pre/post time points
  • Transcriptomics focus: Missing genomic, epigenomic, proteomic data
  • Batch effects: Varied technologies across datasets
  • Cell transfer challenge: No method achieves satisfactory cross-cell-type DE prediction

Future Opportunities

  • Multi-modal integration: DNA, RNA, protein representations
  • Temporal modeling: Time-series perturbation responses
  • Spatial perturbations: Tissue-level modeling
  • Better evaluation metrics: More appropriate for virtual cell tasks
  • Standardized protocols: Reduce technical confounders

Resources & Publication

Access the source code, datasets, and publication

💻 Source Code

Complete codebase with all models, datasets, and evaluation scripts

View on GitHub

📄 Research Paper

Read the complete methodology and findings in our bioRxiv preprint

Read on bioRxiv