ProtGPT2: a deep unsupervised language model for protein design

“ProtGPT2 is a deep unsupervised language model for protein design,” by Noelia Ferruz, Steffen Schmidt, and Birte Hocker, was published in Nature Communications in 2022. It applied the GPT-2 language-model architecture to proteins, treating amino-acid sequences as a language and learning to generate new ones.

Trained on tens of millions of natural protein sequences without any labels, ProtGPT2 produces de novo sequences that statistically resemble real proteins: they show natural amino-acid frequencies, and disorder predictions indicated that about 88 percent of generated proteins are globular, like natural ones. At the same time, database searches showed the sequences are only distantly related to anything known, meaning the model explores genuinely new regions of protein space rather than copying.

ProtGPT2 sits alongside structure-based design tools such as RFdiffusion but takes a different route. Instead of designing a three-dimensional backbone and finding a sequence to fit it, it generates sequences directly from learned language patterns, much as a text model writes plausible sentences.

For a general reader, ProtGPT2 is an early demonstration that the language-model paradigm transfers cleanly to biological sequences. It helped open the way for the larger protein and genome foundation models that followed, and it reframed protein engineering as a generative task.

ProtGPT2: a deep unsupervised language model for protein design

Sources

Related