The Road to the Transformer

How sixty years of neural network research led to ChatGPT

14 stops, 1958 - 2022

Every overnight success is decades in the making. The chatbot that reached a hundred million users in two months sits on top of a research lineage that runs through a 1958 Navy press conference, a buried Harvard dissertation, a Czech word-vector trick, and one 2017 paper with the most confident title in computer science. This trail walks that lineage stop by stop - each one a primary-sourced entry in the library.

milestone November 1958

The perceptron: a machine that learns from examples

Frank Rosenblatt's perceptron was an early trainable neural network that adjusted its own connections to classify patterns.

It starts with a machine that learns. The perceptron could only draw straight lines through data, but the core idea - adjust the connections, not the program - is the same one running inside every model today.
milestone August 1974

Werbos describes backpropagation in his Harvard PhD thesis

Paul Werbos's 1974 Harvard dissertation set out the method later known as backpropagation for training multilayer networks.

The perceptron's fatal limit was that nobody knew how to train networks with hidden layers. The answer sat in a Harvard dissertation for over a decade before anyone noticed.
milestone October 9, 1986

Rumelhart, Hinton, and Williams popularize backpropagation in Nature

The 1986 Nature paper made backpropagation the standard way to train multilayer neural networks with hidden units.

When Rumelhart, Hinton, and Williams put backpropagation in Nature, training deep networks became something any lab could do. The tool was ready; the data and compute were not.
milestone January 16, 2013

word2vec makes word embeddings practical

Mikolov and colleagues introduced word2vec, an efficient way to learn word vectors that capture meaning through geometry.

Fast forward through a winter and a revival: word2vec showed that meaning itself could live in geometry - king minus man plus woman lands near queen. Language was now something you could do math on.
milestone September 10, 2014

Sequence to sequence learning

Sutskever, Vinyals, and Le showed neural networks could map input sequences to output sequences, enabling end-to-end translation.

If words are vectors, sentences can be journeys between them. Seq2seq made whole-sequence-to-whole-sequence learning work, but it squeezed every sentence through one fixed bottleneck.
paper September 2014

Neural Machine Translation by Jointly Learning to Align and Translate

The 2014 paper by Bahdanau, Cho, and Bengio that introduced attention, letting a translation model look over the whole sentence instead of one fixed summary.

Attention removed the bottleneck: let the model look back at any part of the input whenever it needs to. This one mechanism is the seed of everything that follows.
milestone June 12, 2017

The Transformer is introduced

Google researchers publish 'Attention Is All You Need', introducing the Transformer architecture that underpins modern AI.

Then Google asked the radical question: what if attention is not an add-on but the whole architecture? Drop the recurrence entirely. Eight authors, one paper, and the blueprint for modern AI.
milestone June 11, 2018

OpenAI's first GPT

OpenAI's first GPT showed that pre-training a Transformer on large amounts of unlabeled text and then fine-tuning it set a new bar across many language tasks.

OpenAI's bet was different from everyone else's: do not build a translation model or a parser - just pre-train a Transformer to predict text, then adapt it. The G, P, and T each mark a deliberate choice.
milestone October 11, 2018

BERT brings deep bidirectional pre-training to language

Google researchers release BERT, a pre-trained bidirectional Transformer that set new state-of-the-art on eleven NLP benchmarks.

Google answered with BERT, reading text in both directions at once. For a while the field split into two camps - BERT for understanding, GPT for generating.
milestone February 14, 2019

OpenAI's GPT-2 shows language models can learn tasks unsupervised

OpenAI's 1.5-billion-parameter GPT-2, trained only to predict text, performed many language tasks with no task-specific training.

GPT-2 settled an argument nobody knew they were having: a model trained only to predict the next word starts doing translation, summarization, and question answering on its own.
milestone January 23, 2020

Scaling laws give AI a predictable recipe for bigger models

OpenAI researchers showed language model performance follows smooth power laws in model size, data, and compute.

Why stop at 1.5 billion parameters? The scaling-laws paper turned 'bigger is better' from a hunch into an equation - and made the next step a calculated investment rather than a gamble.
milestone May 28, 2020

GPT-3 makes few-shot learning work at massive scale

OpenAI's 175-billion-parameter GPT-3 performed many tasks from a few text examples alone, with no fine-tuning or weight updates.

GPT-3 was the equation cashed in: 175 billion parameters, and a new behavior nobody explicitly built - show it a few examples in the prompt and it picks up the task.
milestone January 27, 2022

InstructGPT brings RLHF to GPT-3

OpenAI's InstructGPT used reinforcement learning from human feedback to make GPT-3 follow instructions, the direct precursor of ChatGPT.

Raw GPT-3 completed text; it did not listen. InstructGPT used human feedback to turn a text predictor into something that follows instructions - the final ingredient.
milestone November 30, 2022

OpenAI launches ChatGPT

OpenAI releases ChatGPT, a conversational AI that reaches mass adoption and ignites the modern generative-AI boom.

Put the instruction-tuned model behind a chat box and the sixty-year lineage becomes a product anyone can use. The rest is the world we live in now.

Next trail: The AI Winters ->

<- All trails