A systolic array is a hardware design in which a grid of small, identical processing cells each do a tiny piece of arithmetic and then pass their results to neighboring cells, in a rhythmic, clocked flow. The name, coined by H. T. Kung and Charles Leiserson in a 1978 paper, borrows from physiology: data pulses through the array the way blood is pumped through the body by the systole of the heart. Each datum is read from memory once, flows through many cells doing useful work along the way, and only the final results are written back.
That property is exactly what matrix multiplication wants, and matrix multiplication is what neural networks spend almost all their time doing. In a systolic matrix engine the weights are loaded into the grid and held in place while activations stream in from one edge; partial sums accumulate as the data marches across, and finished results emerge from the far side. Because operands are reused in place rather than repeatedly fetched, the design avoids the energy and time cost of shuttling numbers between registers and memory - the dominant overhead in ordinary processors.
The clearest modern example is Google’s Tensor Processing Unit. The 2017 paper describing the first TPU centers on a matrix multiply unit of 65,536 eight-bit multiply-accumulate cells - a 256 by 256 systolic array - delivering 92 trillion operations per second and running neural-network inference far more efficiently than the CPUs and GPUs of its day. The idea, dormant in textbooks for decades, found its killer application in deep learning.
Why business readers should care: the systolic array is the architectural reason custom AI chips can beat general-purpose processors on cost and power for the matrix math that dominates AI. It is a forty-year-old idea that quietly underpins a large share of today’s AI training and inference hardware.