Superscalar Execution

A superscalar processor issues more than one instruction per clock cycle. A basic pipeline, no matter how deep, has a ceiling of one completed instruction per cycle, because it has a single path through fetch, decode, and execute. A superscalar design widens that path: it fetches and decodes several instructions at once and dispatches them to multiple execution units that operate in parallel, for example two integer units, a load/store unit, and a floating-point unit. When the instructions in a window do not depend on one another, the machine can complete two, four, or more of them in a single cycle.

This extraction of parallelism from an ordinary, sequential instruction stream is called instruction-level parallelism, and it is the central subject of Hennessy and Patterson’s “Computer Architecture: A Quantitative Approach.” The textbook defines the issue width as the maximum number of instructions a processor can dispatch per cycle and analyzes how far real programs let a machine approach that maximum. In practice dependencies, limited execution resources, and branches all keep the achieved parallelism below the theoretical width.

The roots of superscalar issue reach back to the multiple-functional-unit machines of the 1960s. The IBM System/360 Model 91, described in R.M. Tomasulo’s 1967 paper “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” already had several arithmetic units that could be busy at once, and Tomasulo’s scheme for keeping them fed with operands is a direct ancestor of the hardware in modern superscalar cores. The general idea of issuing multiple instructions per cycle from a single stream became mainstream in commercial microprocessors in the early 1990s.

Superscalar execution is closely tied to two other techniques. To keep multiple execution units busy, the processor usually needs out-of-order execution, so that an instruction whose operands are ready can run even if an earlier instruction is stalled. And it needs accurate branch prediction, because a wide machine that flushes on every branch wastes a large amount of work. A superscalar core that also executes out of order and speculates past branches is the dominant design for high-performance general-purpose CPUs.

The practical limit of superscalar width is diminishing returns: doubling the issue width does not double performance, because most code does not contain enough independent instructions to fill the extra slots. This limit is one of the reasons the industry turned toward multiple cores, putting several modestly superscalar processors on a chip rather than building one ever-wider processor.

Sources

Related