Softmax is the mathematical function that converts a list of raw scores into a set of probabilities that are all positive and add up to one. Each score is exponentiated and then divided by the sum of all the exponentiated scores, so the largest input becomes the largest probability while smaller inputs still get a share. It is the standard final step in a classifier and in a language model: when a model predicts the next word, softmax turns its internal scores over the whole vocabulary into a probability for each possible token.
The function as used in neural networks is generally traced to John S. Bridle’s 1989 paper “Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters,” presented at NIPS, which described modifying a network’s output layer to produce a mathematically correct probability distribution - the normalized exponential now universally called softmax. The same function appears inside the attention mechanism of Transformers, where it converts attention scores into weights, and its “temperature” parameter controls how sharply a language model favors its top choice when generating text.
Why business readers should care: Softmax is the quiet step that lets an AI express confidence as probabilities, which is what makes outputs rankable and tunable. The temperature setting that controls how ‘creative’ or ‘deterministic’ a chatbot is operates directly on this function.