Falcon is the family of open-weight large language models built by the Technology Innovation Institute (TII) in Abu Dhabi. TII presented Falcon-40B as one of the first home-grown open-source LLMs released with weights for both research and commercial use, and it briefly topped open-model leaderboards in 2023. The line later scaled to Falcon-180B, a 180-billion-parameter model trained on 3.5 trillion tokens and released under Apache 2.0-based royalty-free licensing.
What set Falcon apart was its data. Rather than relying on curated books and academic corpora, TII trained on the RefinedWeb dataset, described in a June 2023 paper as filtered and deduplicated Common Crawl web data βand web data only.β The paper argued this approach could match or beat models trained on curated sources, challenging a common assumption that high-quality data had to be hand-selected. TII extracted about five trillion tokens for training and released a 600-billion-token slice of RefinedWeb publicly. The Falcon family has since expanded into multimodal and language-specific variants.
Why business readers should care: Falcon proved that frontier-grade open models could come from a state research institute outside the established AI hubs, and its RefinedWeb work reshaped thinking about how much careful data curation large models actually require.