StarCoder

StarCoder is an open code generation model produced by BigCode, an open scientific collaboration co-led by Hugging Face and ServiceNow, and described in “StarCoder: may the source be with you!” (arXiv, May 2023, by Raymond Li and 66 co-authors). StarCoder and its base model StarCoderBase are 15.5-billion-parameter models with an 8,000-token context window, infilling capability, and fast batched inference using multi-query attention.

StarCoderBase was trained on one trillion tokens from The Stack, a large corpus of permissively licensed GitHub repositories that came with inspection tools and an opt-out process so developers could remove their code. StarCoder itself was a version fine-tuned on 35 billion additional Python tokens. The paper reports that StarCoderBase outperformed every open code model supporting multiple languages and matched OpenAI’s code-cushman-001, with StarCoder reaching 40 percent pass@1 on HumanEval when prompted well.

Beyond raw capability, StarCoder is notable for taking data provenance and consent seriously, an area many earlier code models ignored. For businesses weighing the legal and reputational risk of code models trained on scraped repositories, StarCoder’s emphasis on licensing, attribution, and opt-out set an early standard for responsible data practices in code AI.

Sources

Last verified June 7, 2026