Mixtral 8x7B uses only about 13B of its 47B parameters per token

Mistral AI’s Mixtral 8x7B is a sparse mixture-of-experts model. It contains about 46.7 billion total parameters, but a router selects only 2 of its 8 expert blocks per layer for each token, so roughly 12.9 billion parameters are actually used to process any given token. This is why Mixtral delivers the quality of a large model at a fraction of the inference cost, and Mistral reported it runs about six times faster than the dense Llama 2 70B.

Sources

Last verified June 7, 2026