“DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” was submitted to arXiv on May 7, 2024 by the Chinese lab DeepSeek. The model has 236 billion total parameters but activates only 21 billion for any given token, using a mixture-of-experts design (DeepSeekMoE) so that most of the network stays idle on each step, cutting computation.
The paper’s headline architectural contribution is Multi-head Latent Attention (MLA), which compresses the key-value cache, the memory that grows with the length of the conversation, into a small latent vector. The authors reported a 93.3 percent reduction in KV cache size, a 42.5 percent reduction in training cost compared to their earlier 67B dense model, and up to 5.76x higher maximum generation throughput, all while remaining among the strongest open-source models of its time.
DeepSeek-V2 established the efficiency recipe, MoE plus MLA, that the lab carried into DeepSeek-V3 and its R1 reasoning model. Its significance is economic: by shrinking the two biggest inference costs, active parameters and KV-cache memory, DeepSeek showed that competitive large models could be served far more cheaply than the prevailing dense designs, pressuring industry-wide pricing assumptions.