The Language Processing Unit, or LPU, is the AI accelerator designed by Groq specifically to serve large language models at high speed. Where a GPU is a general-purpose parallel processor adapted for AI, the LPU is purpose-built for the inference step, the moment a trained model generates output, and the few linear-algebra operations, mainly matrix multiplication, that dominate that workload.
The LPU’s design rests on two ideas. The first is keeping model data in fast on-chip SRAM rather than in slower external high-bandwidth memory: Groq reports its on-chip SRAM offers around 80 terabytes per second of bandwidth, roughly ten times the bandwidth of the off-chip HBM used by GPUs. The second is deterministic, compiler-scheduled execution. The LPU removes the reactive hardware that general processors rely on, such as caches, branch predictors, and reordering buffers, and instead has the compiler decide in advance exactly when every operation and data movement happens. Groq describes this as a programmable assembly line, with data moving along conveyor belts between function units, so timing is predictable to the clock cycle.
The result is very low and very consistent latency for generating tokens, the kind of responsiveness that matters for interactive applications. For a business reader, the LPU is a notable example of the bet that the next wave of AI economics will be won at inference rather than training, and that purpose-built inference silicon can challenge the GPU’s dominance on speed and cost per query.