Orca: A Distributed Serving System for Transformer-Based Generative Models

Orca, presented by Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun of Seoul National University at OSDI 2022, introduced the scheduling idea now widely known as continuous batching, which underpins modern LLM serving. The problem it solves is that generative models produce text one token at a time over many steps, and requests in a batch finish at different times. Traditional serving batches at the granularity of whole requests, so the whole batch waits for the slowest request, wasting GPU capacity.

Orca’s iteration-level scheduling instead schedules at the granularity of a single model iteration: after each step the scheduler can swap finished requests out and bring new ones in, keeping the hardware busy. To combine batching with this finer scheduling inside a transformer, Orca added selective batching, which batches most operations together but handles the attention operation per-request because requests have different lengths. Evaluated on a GPT-3 175B model, Orca outperformed NVIDIA FasterTransformer by 36.9 times in throughput at the same latency.

These ideas were adopted by serving systems like vLLM and TensorRT-LLM and became standard practice.

For a business, Orca is why serving large models in production got dramatically cheaper: the same GPUs can handle far more concurrent users when the scheduler stops idling between requests.

Orca: A Distributed Serving System for Transformer-Based Generative Models

Sources

Related