Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring Attention, published by Hao Liu, Matei Zaharia, and Pieter Abbeel in October 2023, is a systems technique for training and running transformers on extremely long sequences by spreading the work across many devices. The core limit on context length is memory: the attention computation for a long sequence does not fit on a single accelerator. Ring Attention partitions the sequence into blocks and distributes them across devices arranged conceptually in a ring.

The clever part is the communication pattern. As each device computes blockwise attention and feedforward over its chunk, it passes key-value blocks around the ring, and this communication is fully overlapped with computation so it adds essentially no extra time. Because each device only ever holds part of the sequence, the maximum context grows in proportion to the number of devices, with no approximations and no extra overhead. The method can reach context lengths device-count times longer than prior memory-efficient transformers, scaling to millions of tokens.

Ring Attention became an important building block behind the very long context windows announced by frontier labs.

For a business, it is part of the engineering that turned million-token context from a research aspiration into a feature you can actually buy, by letting model providers scale context across their hardware fleet.

Ring Attention with Blockwise Transformers for Near-Infinite Context

Sources

Related