BloombergGPT was trained on about 708 billion tokens, roughly half of them financial

fact March 30, 2023

BloombergGPT, the 50-billion-parameter finance model Bloomberg described in a March 2023 paper, was trained on a combined corpus of roughly 708 billion tokens. About 363 billion came from Bloomberg’s own financial data sources, which the authors call perhaps the largest domain-specific dataset yet assembled, and about 345 billion came from general-purpose public datasets. The near-even split was deliberate: the financial half gave the model domain expertise while the general half preserved broad language ability, so it could do well on financial tasks “without sacrificing performance on general LLM benchmarks.”

Sources

PRIMARY https://arxiv.org/abs/2303.17564

Last verified June 7, 2026

<- Back to the AI Library

BloombergGPT was trained on about 708 billion tokens, roughly half of them financial

Sources

Related