“Gaussian Error Linear Units (GELUs)” was submitted to arXiv on June 27, 2016 by Dan Hendrycks and Kevin Gimpel. It introduced a new activation function - the small nonlinear operation applied to each neuron’s output that lets a network represent more than just linear relationships - that has since become the default in Transformer language models.
The dominant activation at the time was the ReLU, which simply passes positive values through and sets negatives to zero, a hard on-off gate based only on the sign of the input. GELU instead weights each input by the probability that a standard normal value falls below it, written as x times the Gaussian cumulative distribution function. The effect is a smooth curve that mostly behaves like ReLU for large values but bends gently near zero, letting small negative inputs through partially rather than killing them outright. The authors framed it as a probabilistic gate that scales an input by how large it is, not merely by its sign, and reported improvements over ReLU and ELU across vision, language, and speech tasks.
GELU’s significance is mostly downstream: BERT and the GPT family adopted it, and through them it became the standard activation inside the feed-forward blocks of most Transformers. Later models sometimes use related smooth activations like SwiGLU, but GELU remains one of the small, widely copied design decisions that the modern architecture inherited and rarely questioned.
The paper is a reminder that even a choice as mundane as which nonlinearity to use, multiplied across billions of parameters and trillions of operations, is worth getting right - and that good defaults, once established, propagate for years.