Experiment tracking is the practice of systematically recording everything that defines a single machine learning training run: the hyperparameters it used, the version of the code, the dataset it was trained on, the metrics it produced, and the artifacts it created such as model weights or evaluation plots. Each run is captured as a tracked entity so that, later, any result can be reproduced, explained, and compared against alternatives.
The practice exists because machine learning development is inherently empirical. A data scientist may run hundreds of variations, changing a learning rate here, a feature set there, a model architecture elsewhere, and the only way to know which change helped is to compare runs against a common record. Without tracking, that record lives in scattered notebooks, file names, and human memory, and the work becomes impossible to reproduce. MLflow’s tracking documentation describes its component as an API and UI “for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.”
A tracked run is, in effect, the machine learning equivalent of a commit. Where version control records what the source code was at a point in time, experiment tracking records what the whole experiment was: code plus data plus configuration plus outcome. Runs are typically grouped into experiments so that a family of related attempts can be charted side by side, and the best run can be promoted toward production with a clear provenance.
Two tools made this practice mainstream. MLflow Tracking provided an open-source, framework-agnostic way to log runs and visualize them. Weights and Biases offered a hosted dashboard whose documentation describes logging hyperparameters into a run’s configuration, logging metrics during the training loop, and uploading run outputs as artifacts. Both turned a previously informal habit into reliable infrastructure.
Experiment tracking is a foundational element of MLOps. It supplies the reproducibility and lineage that the rest of the discipline depends on: a model promoted to serving can be traced back to the exact run, data, and parameters that produced it, which is what makes a machine learning system auditable rather than mysterious.