“Horovod: fast and easy distributed deep learning in TensorFlow” was published on arXiv in February 2018 by Alexander Sergeev and Mike Del Balso of Uber. It tackled a usability problem: distributing training across many GPUs in early TensorFlow required substantial, error-prone changes to a model’s code, and scaling efficiency often degraded badly as more workers were added.
Horovod’s approach was to handle the gradient synchronization with an efficient ring-allreduce communication pattern, a scheme in which each worker exchanges gradient chunks with its neighbors in a ring so that bandwidth use stays balanced as the cluster grows. The library, whose source was based on Baidu’s earlier tensorflow-allreduce work, required only a few lines of modification to a single-GPU script to turn it into a multi-GPU, multi-host job. It later added support for Keras, PyTorch, and Apache MXNet. Horovod was open-sourced under Apache 2.0 and, in December 2018, joined the Linux Foundation’s LF AI and Data Foundation.
Horovod popularized the idea that data-parallel scaling should be a small, framework-agnostic add-on rather than a rewrite, and ring-allreduce became a widely used building block for distributed training.
For a business reader, Horovod is an example of how making a hard scaling technique easy to adopt can matter as much as the technique itself.