In-Datacenter Performance Analysis of a Tensor Processing Unit

This paper, presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017 and led by Norman Jouppi, David Patterson, and a large team at Google, was the first detailed public description of the Tensor Processing Unit. The TPU was a custom application-specific integrated circuit (ASIC) that Google had quietly deployed in its datacenters since 2015 to accelerate the inference phase of neural networks, the step where a trained model produces predictions on live traffic.

The chip’s core was a matrix-multiply unit built from 65,536 8-bit multiply-accumulate cells arranged as a systolic array, delivering a peak of 92 trillion operations per second backed by 28 mebibytes of software-managed on-chip memory. The authors measured the TPU against contemporary CPUs and GPUs on production workloads and reported it running roughly 15 to 30 times faster while delivering 30 to 80 times better performance per watt. They argued that its deterministic, single-threaded execution model met strict 99th-percentile latency targets better than the time-varying optimizations of general-purpose processors.

The paper opens with the claim that major gains in cost, energy, and performance must now come from domain-specific hardware rather than general-purpose scaling, a thesis that has shaped the AI accelerator industry ever since. For a business reader, it documents the moment a major cloud provider proved that purpose-built silicon, not commodity chips, would underpin large-scale machine learning, kicking off a wave of custom AI chip programs across the industry.

In-Datacenter Performance Analysis of a Tensor Processing Unit

Sources

Related