Evo: sequence modeling and design from molecular to genome scale

“Sequence modeling and design from molecular to genome scale with Evo,” from the Arc Institute and Stanford, was published in Science in November 2024. Evo is a foundation model for DNA: it reads and writes genetic sequences the way a large language model reads and writes text, but its alphabet is the four nucleotide letters and its training corpus is millions of prokaryotic and bacteriophage genomes.

Evo has 7 billion parameters and a context window of 131 kilobases at single-nucleotide resolution, long enough to span whole genes and the regulatory regions around them. Built on a hybrid architecture combining attention with state-space layers, it was led by a team including Eric Nguyen, Michael Poli, Brian Hie, and Patrick Hsu. Because DNA encodes RNA and protein, a single model trained on raw sequence learns to reason across all three molecular layers.

The paper showed Evo doing two kinds of work. It predicts how small DNA mutations affect an organism’s fitness, performing zero-shot function prediction competitive with specialized protein and RNA models. And it generates new functional sequences, including CRISPR-Cas systems and mobile genetic elements that the authors took into the laboratory to test.

For a general reader, Evo is a milestone in treating the genome as a language that can be modeled and even authored. The same recipe that produced text generators is being pointed at the code of life, with the long-term hope of designing biology to specification rather than discovering it by trial and error.

Evo: sequence modeling and design from molecular to genome scale

Sources

Related