Parallel Networks that Learn to Pronounce English Text (NETtalk)

“Parallel Networks that Learn to Pronounce English Text” by Terrence J. Sejnowski and Charles R. Rosenberg was published in 1987 in the journal Complex Systems, volume 1, pages 145 to 168. The system it described, called NETtalk, became one of the most vivid early demonstrations that a neural network could learn a complicated human skill from examples rather than from hand-written rules.

NETtalk read English text and produced the phonemes needed to speak it aloud. A sliding window of seven letters fed an input layer of 203 units; these connected through a single hidden layer of 80 units to 26 output units that encoded the sound and stress for the letter in the middle of the window. The roughly 18,000 connection weights were trained with backpropagation against a corpus of transcribed speech, and the phoneme codes drove a separate speech synthesizer so people could hear the network improve.

What made the demonstration memorable was the listening. Early in training the network babbled; as it practiced it moved through stages that sounded loosely like a child learning to read, eventually pronouncing both its training text and new words it had never seen. Sejnowski circulated recordings of this progression, and they did more than any equation to convince a wide audience that connectionist networks could learn structure on their own.

NETtalk was not a practical text-to-speech product, and rule-based systems of the day were more accurate. Its importance was as a proof of principle during the resurgence of neural networks in the mid-1980s, showing that a single network with hidden units could absorb the messy, exception-filled mapping from English spelling to sound directly from data.