DS-1000 is a benchmark for data-science code generation, introduced in “DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation” (arXiv, November 2022, by Yuhang Lai and colleagues). It contains 1,000 problems spanning seven popular Python libraries, including NumPy and Pandas, with the problems sourced from real StackOverflow questions so they reflect the kind of work practitioners actually do.
The benchmark is designed to be both realistic and reliable. It evaluates solutions with multi-criteria checks that test functional correctness by running test cases and also verify that the code respects specified API constraints, so a solution that happens to produce the right output the wrong way can still be caught. The authors report that only 1.8 percent of solutions their system accepted were actually incorrect. To prevent models from simply recalling memorized training data, problems were deliberately modified from their original form. At release, the best system reached just 43.3 percent accuracy.
DS-1000 mattered because earlier code benchmarks like HumanEval tested small self-contained functions, while real data work involves libraries, dataframes, and array operations. For businesses that rely on analytics and data engineering, DS-1000 gave a more honest measure of whether a code model can handle the everyday tasks of a working data scientist.