Highway Networks

“Highway Networks” was submitted to arXiv on May 3, 2015 by Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber of IDSIA in Switzerland, months before the ResNet paper that would popularize the same core idea. It addressed a problem that was blocking the whole field: as networks grew past a few dozen layers, plain stacks of layers became harder to train, not easier, because the gradient signal degraded as it propagated back through the depth.

The fix borrowed directly from the LSTM, also a Schmidhuber-lineage invention. Each layer is equipped with a learned transform gate and carry gate that decide, per unit, how much of the layer’s transformation to apply versus how much of the input to carry through unchanged. When the carry gate opens, the input flows straight to the next layer along an “information highway,” so a deep network can pass signal through layers that have learned to do nothing. The authors trained networks with tens and even hundreds of layers using ordinary stochastic gradient descent, which plain deep networks of that era could not manage.

Highway Networks were quickly overshadowed by ResNet, which appeared in December 2015 and achieved a similar effect with a simpler, ungated additive shortcut - no learned gates, just add the input back. ResNet won ImageNet that year and became the standard. But the conceptual move was the same: give the network an easy path for information and gradients to skip layers. The skip connection in nearly every large model trained since, including the Transformer, traces to this line of work.

The lesson is one the field has relearned repeatedly: depth is not free, and the architectures that scale are the ones that keep gradients flowing cleanly. The gating proved more machinery than necessary, but the underlying insight was correct and durable.

Sources

Related