“Hyena Hierarchy: Towards Larger Convolutional Language Models,” submitted to arXiv on February 21, 2023 by Michael Poli, Stefano Massaroli, and colleagues including Tri Dao and Christopher Re, proposed a way to build language models without the attention mechanism that defines Transformers. The motivation is attention’s quadratic cost in sequence length, which limits how much context a model can afford to process.
Hyena replaces the attention operator with a recurrence of two ingredients: long convolutions whose filters are produced implicitly by a small neural network rather than stored explicitly, and data-controlled gating that lets the model modulate signals based on the input. Interleaving these gives an operator with subquadratic cost that can still mix information across very long ranges, similar in spirit to the state space models it builds on.
The empirical results were competitive. Hyena matched Transformer accuracy on language modeling benchmarks while using about 20 percent less training compute at a 2,000-token sequence length, and it was dramatically faster on long inputs: roughly twice as fast as a heavily optimized attention implementation at 8,000 tokens and about 100 times faster at 64,000 tokens.
Hyena is part of the broader search for efficient alternatives to attention. For applications that need very long context, such as analyzing whole documents, codebases, or genomic sequences, architectures like Hyena promise to make long-range modeling far cheaper than standard Transformers allow.