Gemma 2: Improving Open Language Models at a Practical Size

“Gemma 2: Improving Open Language Models at a Practical Size” was released by Google DeepMind on arXiv on July 31, 2024. It introduced the second generation of the Gemma open-model family, with sizes ranging from 2 billion to 27 billion parameters, aimed squarely at strong performance at sizes practical to run rather than the largest possible scale.

The paper combines several architectural choices: interleaving local and global attention to handle context efficiently, and grouped-query attention to speed up inference. Its most distinctive technique is using knowledge distillation, training the smaller 2B and 9B models to imitate a larger teacher model rather than learning purely from next-token prediction on raw text. The authors report that the resulting models are competitive with models two to three times larger.

Gemma 2 is a clear demonstration of the “practical size” philosophy that has come to define much of the open-weight ecosystem: rather than chasing parameter counts, squeeze more capability into models that fit on a single accelerator. For organizations, distillation-trained small models mean lower serving cost and easier deployment without giving up much quality.

Gemma 2: Improving Open Language Models at a Practical Size

Sources

Related