This paper by Amir Gholami, Michael Mahoney, Kurt Keutzer, and colleagues at UC Berkeley, published in IEEE Micro in 2024, makes a quantitative case that the main bottleneck for modern AI is no longer raw compute but memory. The authors track how the relevant hardware metrics have scaled over roughly two decades and find a widening gap: peak server arithmetic throughput has grown about 3 times every two years, while memory bandwidth has grown only about 1.6 times and interconnect bandwidth about 1.4 times over the same period.
The consequence is what the authors call a memory wall. As models and their context windows grow, the work of training and especially serving them increasingly waits on data being moved to and from memory rather than on the arithmetic units themselves. This is particularly severe for decoder-style Transformers used in text generation, where each new token requires re-reading large amounts of model weights and cached state. The paper analyzes both encoder and decoder architectures to show where the constraint bites hardest.
The authors argue that closing the gap will require rethinking model architectures, training methods, and deployment strategies rather than simply waiting for faster chips. For a general reader, the paper explains why so much AI hardware engineering now centers on high-bandwidth memory, faster interconnects, and techniques like quantization: the industry is fighting a memory problem, not just a compute problem.