“Approximation by Superpositions of a Sigmoidal Function,” by George Cybenko, was published in the journal Mathematics of Control, Signals and Systems in 1989 (volume 2, pages 303 to 314). It is the most-cited statement of what is now called the universal approximation theorem, a result that underpins the whole enterprise of using neural networks as general-purpose function fitters.
Cybenko proved that a feedforward network with a single hidden layer of neurons, each applying a sigmoidal (S-shaped) nonlinearity, followed by a weighted sum, can approximate any continuous function on a bounded region to any desired accuracy, given enough hidden units. In other words, even the simplest nontrivial network architecture is, in principle, expressive enough to represent essentially any input-output relationship one might care about. Around the same time Kurt Hornik and others established closely related results under broader conditions, cementing the conclusion.
It is important to read the theorem carefully. It guarantees that a good approximation exists; it says nothing about how many neurons you might need, whether training will actually find the right weights, or how well the result will generalize to new data. Those harder questions are exactly what later theory, including VC dimension, the neural tangent kernel, and double descent, tries to address.
For a general reader, the universal approximation theorem provides the basic reassurance behind neural networks: there is no expressive ceiling stopping them from capturing a pattern. The remaining difficulty is practical, not one of raw representational power.