GGUF Format

GGUF is a binary file format for storing machine learning models so they can be loaded and run for inference. Its specification, hosted in the ggml repository, defines it as a format for storing models for inference with GGML and executors built on GGML. It was introduced in 2023 as the distribution format for quantized large language model weights, the kind of file a user downloads to run a model locally on their own machine rather than calling a remote service.

The format is explicitly a successor. The specification states that GGUF supersedes the earlier GGML, GGMF, and GGJT formats, and was designed to be unambiguous by containing all the information needed to load a model in a single file. The older formats had practical shortcomings: they did not reliably identify which model architecture a file belonged to, and adding new hyperparameters tended to break compatibility with existing files. GGUF was built to fix those problems so that a file is self-describing and forward-compatible.

Mechanically, a GGUF file packs both metadata and tensor data into one container designed for fast loading and easy reading. The metadata is stored as a flexible set of key-value pairs that record the architecture, hyperparameters, tokenizer information, and other details a runtime needs, while the tensor data holds the actual weights. Crucially for local use, those weights are typically quantized, stored at reduced numeric precision such as low-bit integers, which is what shrinks a model enough to fit in the memory of a consumer device.

The format exists primarily because of one engine: it is the format that llama.cpp consumes, alongside the broader family of ggml-based tools. By standardizing how a quantized model is serialized, GGUF decoupled the act of producing and converting a model from the act of running it. A model can be converted from a training-framework format into GGUF once, at a chosen quantization level, and then distributed as a single portable file that any compatible runtime can open.

In practice, GGUF became the lingua franca of the local-inference community. Model registries fill with GGUF files at various quantization levels, letting a user pick the trade-off between size and quality that fits their hardware, then download and run a large language model entirely offline. As a piece of format design, its contribution is the same as that of well-built container formats elsewhere in software: a stable, self-describing wrapper that lets producers and consumers evolve independently while sharing one file on disk.

Sources

Related