In March 2023 Bloomberg released a paper describing BloombergGPT, a 50-billion-parameter language model purpose-built for finance. It was one of the first large language models trained from scratch for a single industry rather than adapted from a general model. The authors - Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann - submitted the paper to arXiv on March 30, 2023.
The model was trained on a combined corpus of about 708 billion tokens. Roughly 363 billion of those tokens came from Bloomberg’s own financial data sources - news, filings, press releases, web content and proprietary data assembled over decades - which the paper calls perhaps the largest domain-specific dataset yet built. The remaining 345 billion tokens were drawn from general-purpose public datasets, so the model would retain broad language ability.
The central claim was that mixing financial and general text produced a model that, in the authors’ words, “outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks.” Bloomberg evaluated it on standard public benchmarks, open financial benchmarks, and its own internal financial tasks.
BloombergGPT became an early reference point for the idea that a company sitting on a large proprietary corpus could train a competitive in-house model rather than rely on a general-purpose one. It arrived just as general models like GPT-4 were demonstrating that scale and broad data could often match or beat narrow, hand-built systems, and the trade-off between domain-specific and general models remained an open question across regulated industries.