Let's Build the GPT Tokenizer

This is Andrej Karpathy’s lecture on building a GPT tokenizer from scratch, published on his own YouTube channel in February 2024 and running about two hours and thirteen minutes. Karpathy, a founding member of OpenAI and former director of AI at Tesla, treats tokenization as a separate stage of the language-model pipeline with its own training set and training algorithm, distinct from the neural network itself.

He codes up Byte Pair Encoding step by step, the algorithm that learns to merge frequent character pairs into tokens, and shows how the encode and decode functions turn text into the integer sequences a model actually sees. Along the way he explains why tokenization quietly causes many well-known LLM weaknesses: trouble with arithmetic, difficulty reversing strings, odd behavior on non-English text, and the reason a model can handle one data format better than another.

This is a hands-on talk for people who want to understand what happens before the transformer ever runs. For a business reader, it demystifies a layer that is invisible in product demos but directly shapes cost, multilingual quality, and the strange edge cases that make large language models stumble.

Let's Build the GPT Tokenizer

Sources

Related