The Road to the Transformer
How sixty years of neural network research led to ChatGPT
Every overnight success is decades in the making. The chatbot that reached a hundred million users in two months sits on top of a research lineage that runs through a 1958 Navy press conference, a buried Harvard dissertation, a Czech word-vector trick, and one 2017 paper with the most confident title in computer science. This trail walks that lineage stop by stop - each one a primary-sourced entry in the library.
-
The perceptron: a machine that learns from examples
Frank Rosenblatt's perceptron was an early trainable neural network that adjusted its own connections to classify patterns.
It starts with a machine that learns. The perceptron could only draw straight lines through data, but the core idea - adjust the connections, not the program - is the same one running inside every model today.
-
Werbos describes backpropagation in his Harvard PhD thesis
Paul Werbos's 1974 Harvard dissertation set out the method later known as backpropagation for training multilayer networks.
The perceptron's fatal limit was that nobody knew how to train networks with hidden layers. The answer sat in a Harvard dissertation for over a decade before anyone noticed.
-
Rumelhart, Hinton, and Williams popularize backpropagation in Nature
The 1986 Nature paper made backpropagation the standard way to train multilayer neural networks with hidden units.
When Rumelhart, Hinton, and Williams put backpropagation in Nature, training deep networks became something any lab could do. The tool was ready; the data and compute were not.
-
word2vec makes word embeddings practical
Mikolov and colleagues introduced word2vec, an efficient way to learn word vectors that capture meaning through geometry.
Fast forward through a winter and a revival: word2vec showed that meaning itself could live in geometry - king minus man plus woman lands near queen. Language was now something you could do math on.
-
Sequence to sequence learning
Sutskever, Vinyals, and Le showed neural networks could map input sequences to output sequences, enabling end-to-end translation.
If words are vectors, sentences can be journeys between them. Seq2seq made whole-sequence-to-whole-sequence learning work, but it squeezed every sentence through one fixed bottleneck.
-
Neural Machine Translation by Jointly Learning to Align and Translate
The 2014 paper by Bahdanau, Cho, and Bengio that introduced attention, letting a translation model look over the whole sentence instead of one fixed summary.
Attention removed the bottleneck: let the model look back at any part of the input whenever it needs to. This one mechanism is the seed of everything that follows.
-
The Transformer is introduced
Google researchers publish 'Attention Is All You Need', introducing the Transformer architecture that underpins modern AI.
Then Google asked the radical question: what if attention is not an add-on but the whole architecture? Drop the recurrence entirely. Eight authors, one paper, and the blueprint for modern AI.
-
OpenAI's first GPT
OpenAI's first GPT showed that pre-training a Transformer on large amounts of unlabeled text and then fine-tuning it set a new bar across many language tasks.
OpenAI's bet was different from everyone else's: do not build a translation model or a parser - just pre-train a Transformer to predict text, then adapt it. The G, P, and T each mark a deliberate choice.
-
BERT brings deep bidirectional pre-training to language
Google researchers release BERT, a pre-trained bidirectional Transformer that set new state-of-the-art on eleven NLP benchmarks.
Google answered with BERT, reading text in both directions at once. For a while the field split into two camps - BERT for understanding, GPT for generating.
-
OpenAI's GPT-2 shows language models can learn tasks unsupervised
OpenAI's 1.5-billion-parameter GPT-2, trained only to predict text, performed many language tasks with no task-specific training.
GPT-2 settled an argument nobody knew they were having: a model trained only to predict the next word starts doing translation, summarization, and question answering on its own.
-
Scaling laws give AI a predictable recipe for bigger models
OpenAI researchers showed language model performance follows smooth power laws in model size, data, and compute.
Why stop at 1.5 billion parameters? The scaling-laws paper turned 'bigger is better' from a hunch into an equation - and made the next step a calculated investment rather than a gamble.
-
GPT-3 makes few-shot learning work at massive scale
OpenAI's 175-billion-parameter GPT-3 performed many tasks from a few text examples alone, with no fine-tuning or weight updates.
GPT-3 was the equation cashed in: 175 billion parameters, and a new behavior nobody explicitly built - show it a few examples in the prompt and it picks up the task.
-
InstructGPT brings RLHF to GPT-3
OpenAI's InstructGPT used reinforcement learning from human feedback to make GPT-3 follow instructions, the direct precursor of ChatGPT.
Raw GPT-3 completed text; it did not listen. InstructGPT used human feedback to turn a text predictor into something that follows instructions - the final ingredient.
-
OpenAI launches ChatGPT
OpenAI releases ChatGPT, a conversational AI that reaches mass adoption and ignites the modern generative-AI boom.
Put the instruction-tuned model behind a chat box and the sixty-year lineage becomes a product anyone can use. The rest is the world we live in now.