Article 6: Transformers and the LLM Era - The Story of AI

In June of 2017, a group of eight researchers at Google published a paper with a title that was also a quiet act of rebellion. Most scientific papers have titles like “A Probabilistic Framework for Sequence Transduction.” This one was called, simply, “Attention Is All You Need.”

It sounds almost flippant. It was not. It was a thesis statement. For years, as we saw last time, the machines that handled language had read words the slow way - one at a time, in order, dragging a memory along behind them. Researchers had added a clever patch called attention, a way for the model to glance back and weigh which earlier words mattered most. And these eight researchers made a daring bet: that you could rip out all the slow, sequential machinery, keep only the attention, and build a language model out of that alone. They called the new design the Transformer.

I want to be clear about how big this is, because it is easy to let it slide past. Nearly every artificial intelligence system you have heard of - the chatbots, the image generators, the coding assistants, all of it - is a descendant of that one 2017 paper. This is the chapter where the machines learned to read, and write, and where the field discovered a recipe so simple and so powerful that it would reorganize the entire world economy around the purchase of computer chips. It runs from 2017 to 2021. And its central, almost unbelievable discovery can be summed up in one word: bigger.

Here is what made the Transformer special, beyond its speed. It paired perfectly with an idea called pre-training. Instead of building a separate machine for each task - one for translation, one for answering questions - you do something much lazier and much more profound. You take one enormous Transformer, and you feed it a staggering amount of ordinary text - books, websites, the rambling content of the open internet - and you give it one mind-numbingly simple job: predict the next word. Over and over, billions of times. Just guess what comes next. And in the process of getting good at that one boring task, the model is forced to absorb grammar, and facts, and reasoning, and the texture of how humans think on the page. It learns about the world by learning to finish our sentences.

In 2018, Google’s version of this, called BERT, shattered records across the board. And a young lab - OpenAI, the one founded as a counterweight just a few years earlier - released its own version, which it named, with no particular fanfare at the time, GPT. And then OpenAI made the bet that would define the era. They asked a question nobody quite knew the answer to: what happens if we just make it bigger? Not smarter in some clever new way. Just bigger. More text, more chips, more connections.

The answer, when it came, was unsettling. In 2019, OpenAI built a model called GPT-2, and it could write paragraphs so coherent that the company did something almost unheard of - it announced that it was too dangerous to release in full, for fear of what people might do with a machine that could generate convincing fake text at scale. Critics rolled their eyes; some called it a publicity stunt. But it was the first public sign that something strange was happening - that simply scaling the thing up was producing capabilities nobody had explicitly built in.

Then, in 2020, came GPT-3, and it was the moment the ground shifted. It was a hundred times larger than GPT-2. And it could do something genuinely new: you could give it a task it had never been trained on - translate this, write a poem in this style, answer this riddle - with just a couple of examples typed into the prompt, and it would simply do it. No retraining. No new programming. You described what you wanted, in plain English, and the machine obliged. This is the moment the phrase large language model entered the world.

And why did simply making it bigger keep working? That same year, OpenAI’s researchers published one of the most consequential findings in the field’s history: the scaling laws. They showed that a model’s performance improved in a smooth, predictable curve as you increased three things - its size, its data, and its computing power. Predictable. That word changed everything. Because if bigger reliably means better, then progress is no longer a mystery to be solved by genius. It becomes a purchase order. It becomes a question of money and chips and will. And that realization set off the largest infrastructure buildout in the history of computing, and raised the stakes so high that a group of senior researchers, led by a man named Dario Amodei, left OpenAI over questions of safety and founded their own lab, called Anthropic, devoted to making these increasingly powerful systems safe.

It was not all chatbots. That same scaling magic was pointed at science, and in 2020, DeepMind’s AlphaFold cracked a problem that had stumped biologists for fifty years - predicting the intricate folded shape of proteins - with an accuracy that rivaled years of painstaking laboratory work. It was a glimpse of these tools as instruments of genuine discovery, not just clever conversationalists.

But not everyone was cheering. As the models swelled, a group of researchers, including Emily Bender and Timnit Gebru, published a sharp warning. These giant language models, they argued, were expensive, environmentally costly, impossible to fully understand, and - crucially - prone to absorbing and amplifying every bias buried in the internet they were trained on. They had a memorable phrase for what these systems really were, underneath the fluency: stochastic parrots. Machines that could produce dazzlingly plausible language without any understanding of what they were saying. The question they raised - is this real comprehension, or just a very sophisticated echo? - would only grow louder. And the fact that the dispute cost Gebru her job at Google turned it into a flashpoint about who is even allowed to question these systems from the inside.

So by the end of 2021, all the pieces were in place. The Transformer gave the field its engine. Pre-training gave it a way to swallow the whole of human writing. The scaling laws gave it a roadmap that read, simply: keep going. And GPT-3 had shown the world a machine that could write and reason from a plain-language request.

But it was still, in a deep sense, a wild thing. Powerful, strange, unpredictable. It would wander off, ignore you, make things up with total confidence. It lived behind a programmer’s interface, accessible to a few thousand specialists. It was a research artifact, not a product. What was missing was a thin final layer - something to tame it, to teach it manners, to make it follow instructions and feel safe enough to hand to your grandmother.

In the last weeks of 2022, that layer arrived, wrapped in a simple white text box. And in two months, a hundred million people would walk through it.

That is where the final chapter begins.

Transformers and the LLM Era

Listen to the article

The video episode is coming soon

Sources and show notes

Made with AI, sourced like a library