SGLang: Efficient Execution of Structured Language Model Programs

SGLang, published by Lianmin Zheng and colleagues in December 2023, is both a programming language and a runtime for applications that make many language model calls, the kind that use multi-step prompting, control flow, parallel calls, and structured inputs and outputs. Such programs were hard to write efficiently because each call was treated independently, repeating work that overlapped across calls.

SGLang’s frontend gives developers primitives for generation and parallelism, simplifying how complex LLM workflows are expressed. Its runtime adds two key optimizations. RadixAttention reuses the key-value cache across calls that share a common prefix, so repeated context, such as a long system prompt or shared few-shot examples, is not recomputed for every call. Compressed finite-state machines accelerate structured decoding, for example generating JSON faster by skipping over fixed parts of the format. Together these yielded up to 6.4 times higher throughput across a range of models and tasks.

SGLang became a widely used open-source serving framework alongside vLLM and TensorRT-LLM, especially for agentic and structured workloads.

For a business, SGLang reflects how much serving cost hides in redundant computation; reusing shared context across many calls can multiply the effective capacity of the same hardware.

SGLang: Efficient Execution of Structured Language Model Programs

Sources

Related