RWKV: Reinventing RNNs for the Transformer Era

RWKV, presented in “RWKV: Reinventing RNNs for the Transformer Era” and submitted to arXiv on May 22, 2023 by Bo Peng, Eric Alcaide, Quentin Anthony, and many community collaborators, set out to combine the best traits of two architectures. Transformers train efficiently in parallel but are expensive at inference because attention cost grows with sequence length. Recurrent networks are cheap at inference but historically hard to train at scale. RWKV aims to get both.

The architecture uses a linear-attention-style mechanism that can be expressed two ways. During training it runs as a parallelizable computation, like a Transformer, so it scales on modern hardware. During inference it can be rewritten as a recurrent network, processing one token at a time with constant memory and compute per step regardless of how long the sequence has grown. This sidesteps the Transformer’s quadratic cost and the ever-growing key-value cache.

The authors scaled RWKV up to 14 billion parameters, the largest dense recurrent network trained at that time, and reported performance comparable to similarly sized Transformers. Notably, RWKV emerged largely from open, community-driven development rather than a single corporate lab.

For deployment, RWKV’s constant per-token inference cost is attractive: it makes long-context and on-device generation cheaper and more predictable. The paper was an early and prominent demonstration that recurrent designs could compete with Transformers at large scale.

RWKV: Reinventing RNNs for the Transformer Era

Sources

Related