NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is open-source software for serving trained machine-learning models in production. Where a training framework’s job ends once a model is trained, Triton’s job begins: it takes finished models and exposes them behind a stable network interface so applications can send inputs and get predictions back, efficiently and at scale.

Its defining feature is that it is framework-agnostic. A single Triton instance can serve models from TensorRT, PyTorch, TensorFlow, ONNX, OpenVINO, and others side by side, on NVIDIA GPUs or on x86 and ARM CPUs, across cloud, data center, edge, and embedded deployments. To get throughput, it offers dynamic batching, which automatically groups incoming requests that arrive close together into a single batch the hardware can process more efficiently, along with concurrent execution of multiple models, sequence batching for stateful models, and model ensembles that chain several models into one request. These capabilities address the practical reality that inference, not training, is where most of the lifetime compute cost of a deployed model is spent.

Why a business reader should care: Triton is a common answer to the question of how to actually put trained models into production cost-effectively, letting one serving layer handle many models from many frameworks on shared hardware.

NVIDIA Triton Inference Server

Sources

Related