The Vector Processor

A vector processor is a machine whose instructions operate not on single numbers but on whole vectors — ordered arrays of data — at a stroke. A single vector instruction can add two arrays of sixty-four elements together, where a conventional scalar processor would need a loop executing sixty-four separate add instructions. By expressing an entire array operation in one instruction, the hardware amortizes the cost of instruction fetch and decode and keeps deeply pipelined arithmetic units fed without the overhead of per-element loop control.

The vector processor reached its defining form in the Cray supercomputers. Richard Russell’s paper “The CRAY-1 Computer System,” published in Communications of the ACM in January 1978, describes the architecture: eight vector registers, each holding sixty-four 64-bit elements, with a vector length register to set how many elements an operation processes and a vector mask register to select individual elements. A vector instruction streams its operands through a pipelined functional unit, producing a steady flow of results once the pipeline fills.

A signature technique of the design was what Cray called “chaining.” Rather than waiting for one vector operation to finish writing its results before the next began, the machine could feed results from one functional unit directly into another as they were produced, so that, for example, a multiply and an add could overlap. Chaining let the processor sustain multiple operations on the same flowing vector data and was central to the Cray-1’s high arithmetic rates. The combination of vector registers, pipelined units, and chaining made the Cray-1 the first commercially successful vector supercomputer.

Vector processing fits problems with abundant regular, array-shaped work: scientific and engineering simulation, weather modeling, fluid dynamics, and other numerically intensive computing, which is why vector machines dominated supercomputing for years. Where the data is uniform and the same operation applies across all of it, a vector processor turns a long scalar loop into a handful of vector instructions and runs them through hardware purpose-built to stream arrays.

The vector idea is the direct ancestor of much of today’s parallel hardware. The SIMD extensions in mainstream CPUs — operating one instruction across a packed register of elements — are vector processing scaled to fit inside a general-purpose chip. Graphics processors take the same throughput-oriented principle further, running enormous numbers of identical operations over large data sets. The arrangement Cray built for supercomputers in the 1970s reappears, in modernized form, in the vector units and GPUs that handle data-parallel work now.

What the vector processor established is a way of thinking about computation as operating on aggregates rather than on individual values. That shift — from “do this to one number, then loop” to “do this to the whole array” — is one of the enduring organizing ideas of high-performance computing, traceable in an unbroken line from the Cray machines to the SIMD and GPU hardware that carries the data-parallel workloads of today.

Sources

Related