Geneformer

Geneformer is a foundation model for single-cell biology, introduced by Christina Theodoris and colleagues in the 2023 Nature paper “Transfer learning enables predictions in network biology.” It applies the transformer architecture, the same family that powers language models, to the gene-expression profiles of individual cells.

The model was pretrained on a corpus of roughly 30 million single-cell transcriptomes, learning in a self-supervised way how genes are switched on and off together across many cell types and conditions. The authors found that this pretraining encoded the structure of gene regulatory networks in the model’s attention weights, so the network captured biological hierarchy without being explicitly told it.

The point of all that pretraining is transfer learning. Many biology questions come with only a handful of relevant samples, far too few to train a model from scratch. Geneformer can be fine-tuned on such small datasets and still make accurate predictions, and the original paper used it to nominate candidate therapeutic targets for cardiomyopathy, a heart-muscle disease.

For a general reader, Geneformer is part of the move toward foundation models in medicine and biology: a large model pretrained once on vast data, then cheaply adapted to specific problems where data is scarce, mirroring how pretrained language models transformed text.