Layer Normalization

“Layer Normalization” was submitted to arXiv on July 21, 2016 by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton at the University of Toronto. It proposed an alternative to batch normalization, the 2015 technique that had sped up training of deep networks by normalizing each feature across the examples in a mini-batch.

Batch normalization has an awkward dependency: its statistics are computed across the batch, so behavior changes with batch size, and it is messy to apply to recurrent networks where the same weights process sequences of different lengths. Layer normalization sidesteps this by computing the mean and variance across all the units in a layer for a single training example, independently of any other example in the batch. That makes the computation identical at training and test time, removes the batch-size dependency entirely, and applies cleanly to recurrent networks and to sequences.

The technique found its lasting home inside the Transformer, which appeared the following year and placed layer normalization at every sub-layer. Every large language model since has used it or a close variant - RMSNorm, a cheaper simplification, is now common in models like Llama. Where in the stack to put the normalization (before or after each sub-layer, “pre-norm” versus “post-norm”) became one of the small but consequential design choices in scaling Transformers stably to many layers.

Layer normalization is a good example of an unglamorous component that turned out to be load-bearing: it rarely makes headlines, but training large models at today’s scale without some form of it would be far harder.

Sources

Related