RULER is a long-context benchmark introduced by NVIDIA researchers in an April 2024 paper, whose subtitle asks “what’s the real context size of your long-context language models?” The authors argue that the popular needle-in-a-haystack test, which only checks whether a model can find one planted fact in a long document, is too easy and overstates how much context a model can truly use. RULER instead generates synthetic tasks with adjustable sequence length and difficulty, including retrieving multiple needles, tracing variables through long chains, aggregating information, and answering multi-hop questions.
The headline finding is sobering. The team evaluated 17 long-context models and found that despite many claiming context windows of 32,000 tokens or more, only about half could maintain satisfactory performance at 32,000 tokens once the tasks went beyond simple retrieval. Performance dropped sharply as context length grew, revealing a large gap between advertised context size and effective context size. Because the tasks are synthetic and configurable, the benchmark can scale to test new claims of ever-longer context.
This matters directly for anyone building on long-context models, such as systems that feed in entire contracts, codebases, or transcripts. RULER shows that a large advertised context window does not guarantee the model actually reasons over all of it, so the claimed limit should be tested, not trusted.