CUDA

CUDA (Compute Unified Device Architecture) is NVIDIA’s platform and programming model for running general-purpose computation on its GPUs. Introduced in 2007, it gave programmers a way to write parallel code in a familiar C/C++ dialect and have that code execute across the hundreds and later thousands of small cores on a graphics chip. Before CUDA, using a GPU for non-graphics work meant disguising the computation as a rendering operation; CUDA exposed the hardware directly as a parallel processor.

The core abstraction is the kernel: a function that runs on the GPU in many parallel instances at once. NVIDIA’s CUDA C++ Programming Guide describes the model in terms of threads grouped into blocks, blocks grouped into grids, and a memory hierarchy that the programmer manages explicitly. As the guide puts it, CUDA C++ “extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular functions.” A small set of language keywords (such as the __global__ qualifier and the <<<...>>> launch syntax) is enough to move work onto the device.

CUDA’s execution style is single-instruction, multiple-thread, a close relative of SIMD: many threads run the same code on different data. This maps naturally onto the wide, throughput-oriented design the GPU already used for shading pixels. The same parallel hardware that NVIDIA’s GPU Gems 2 described as delivering “hundreds of gigaflops” while processing “groups of hundreds of pixels at a time in single-instruction, multiple-data (SIMD) fashion” could now be aimed at linear algebra, simulation, and, eventually, neural networks.

The strategic importance of CUDA is less about any single feature and more about the ecosystem built on top of it. NVIDIA layered domain libraries onto the platform (for dense linear algebra, FFTs, deep-learning primitives, and more), and the major machine-learning frameworks came to depend on those libraries. Because that whole stack assumed CUDA, the platform became a software moat: competitors had to match not just the chip but the years of accumulated libraries, tooling, and documentation.

CUDA also illustrates how a vendor-specific platform can outcompete an open standard on developer experience. The cross-vendor alternative, OpenCL, arrived a year later, but CUDA’s tight integration with NVIDIA hardware, its mature toolchain, and its first-mover library ecosystem kept most high-performance and AI workloads on the proprietary path. By the time deep learning became the dominant driver of GPU demand, CUDA was the default substrate on which that revolution ran.

Sources

Related