Retentive Network: A Successor to Transformer for Large Language Models

“Retentive Network: A Successor to Transformer for Large Language Models,” submitted to arXiv on July 17, 2023 by Yutao Sun, Li Dong, and colleagues at Microsoft Research, proposed RetNet as a foundation architecture meant to combine training efficiency, cheap inference, and strong performance, three goals that the authors framed as an impossible triangle for prior designs.

The heart of RetNet is a retention mechanism that replaces attention and can be computed in three mathematically equivalent ways. A parallel form enables efficient training on modern accelerators, like a Transformer. A recurrent form enables inference with constant, O(1) memory cost per step regardless of sequence length, like a recurrent network. A chunkwise recurrent form blends the two for efficient handling of very long sequences by processing them in blocks. Having all three views of the same computation is what lets RetNet train fast and still deploy cheaply.

The authors reported that RetNet achieved competitive language modeling results while offering favorable inference cost, memory use, and latency compared to Transformers, especially as model size and sequence length grew.

RetNet belongs to the same family as RWKV and the state space models, all chasing an architecture that does not pay the Transformer’s quadratic inference price. For organizations serving large language models at scale, where inference cost dominates, the promise of constant per-token memory is directly tied to the economics of deployment.

Sources

Last verified June 7, 2026