Let's reproduce GPT-2 (124M)

In this four-hour lecture on his own channel, Andrej Karpathy reproduces OpenAI’s GPT-2 model with 124 million parameters, building the network in code and then training it. The video picks up where his earlier “Let’s build GPT” talk left off and pushes all the way to a working reproduction of a model that was once a frontier release.

Karpathy works through the GPT-2 architecture in detail, then turns to the practical engineering that makes training feasible: mixed precision, the way data is loaded and batched, distributed training across multiple GPUs, and the optimizations that bring the run within reach of a modest budget. He shows the loss coming down over the course of training and discusses how to evaluate the result.

What sets this apart from a high-level overview is that nothing is hidden. The viewer sees the real code, the real numbers, and the real decisions that go into reproducing a published model. For a technical reader who wants to understand not just how a transformer is structured but how a language model is actually trained, this is among the clearest resources available, taught by the engineer behind the widely used nanoGPT project.

Sources

Related