DeepSeek-V3 Technical Report

The DeepSeek-V3 Technical Report was submitted to arXiv on December 27, 2024 by the Chinese lab DeepSeek. DeepSeek-V3 is a mixture-of-experts model with 671 billion total parameters, of which only 37 billion are activated per token. It carries forward the Multi-head Latent Attention and DeepSeekMoE architecture from DeepSeek-V2, and adds an auxiliary-loss-free strategy for balancing the load across experts plus a multi-token prediction training objective.

The report’s most discussed claim is cost. The model was trained on 14.8 trillion tokens using only about 2.788 million H800 GPU hours, a strikingly small budget for a model of this capability, and the authors said training was stable throughout with no unrecoverable failures or rollbacks. On benchmarks, DeepSeek-V3 outperformed other open-source models and reached performance comparable to leading closed-source systems.

The combination of frontier-level quality, open weights, and a low reported training cost made DeepSeek-V3 a focal point in debates about how expensive top-tier AI really has to be. For business readers, it reinforced that the cost frontier of building and serving capable models was falling fast, and that strong open alternatives to proprietary APIs were arriving from outside the largest US labs.

Sources

Related