Mixtral 8x7B uses only about 13B of its 47B parameters per token

fact December 11, 2023

Mistral AI’s Mixtral 8x7B is a sparse mixture-of-experts model. It contains about 46.7 billion total parameters, but a router selects only 2 of its 8 expert blocks per layer for each token, so roughly 12.9 billion parameters are actually used to process any given token. This is why Mixtral delivers the quality of a large model at a fraction of the inference cost, and Mistral reported it runs about six times faster than the dense Llama 2 70B.

Sources

PRIMARY https://mistral.ai/news/mixtral-of-experts/

Last verified June 7, 2026

<- Back to the AI Library

Mixtral 8x7B uses only about 13B of its 47B parameters per token

Sources

Related