NVIDIA Volta and the V100: Tensor Cores Arrive

In May 2017 NVIDIA introduced the Volta architecture and its flagship data-center GPU, the Tesla V100, announced at the company’s GTC conference. The defining feature was the Tensor Core, a new kind of execution unit built specifically to accelerate the dense matrix multiplications at the heart of deep learning. Each Tensor Core performs a small matrix multiply-accumulate in mixed precision, taking 16-bit inputs and accumulating in 32-bit, which dramatically raises throughput for neural network training and inference compared with the general-purpose floating-point units used before.

The V100 packed 21 billion transistors and paired its Tensor Cores with high-bandwidth HBM2 memory and NVLink interconnect for fast GPU-to-GPU communication. NVIDIA reported large speedups for training, and the chip became the workhorse behind a generation of landmark models. The same year, NVIDIA’s mixed precision training research showed how to exploit these cores at full accuracy, turning the hardware feature into a practical recipe.

Volta marked a turning point in how AI accelerators are designed: rather than treating deep learning as just another workload for graphics chips, NVIDIA began carving out dedicated silicon for it. For a business reader, the V100 and its Tensor Cores are where the GPU stopped being a repurposed graphics card and became a purpose-built AI engine, a design philosophy every later NVIDIA data-center chip has extended.