BloombergGPT, the 50-billion-parameter finance model Bloomberg described in a March 2023 paper, was trained on a combined corpus of roughly 708 billion tokens. About 363 billion came from Bloomberg’s own financial data sources, which the authors call perhaps the largest domain-specific dataset yet assembled, and about 345 billion came from general-purpose public datasets. The near-even split was deliberate: the financial half gave the model domain expertise while the general half preserved broad language ability, so it could do well on financial tasks “without sacrificing performance on general LLM benchmarks.”