Medusa, published by Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao in January 2024, speeds up language model inference by attacking its fundamental bottleneck: models generate text one token per forward pass, so producing a long answer means many sequential, latency-bound steps.
Medusa adds several extra decoding heads to a model, each trained to predict a token a few positions ahead. At each step the model proposes multiple candidate continuations at once, and a tree-based attention mechanism verifies them in parallel, accepting the longest run that matches what the base model would have produced. Because several tokens can be confirmed in a single pass, the number of decoding steps drops sharply. Medusa-1, which fine-tunes only the extra heads while freezing the backbone, delivers over 2.2 times speedup with no loss in output quality, and Medusa-2, which fine-tunes jointly, reaches 2.3 to 3.6 times.
Medusa is a simpler, self-contained cousin of speculative decoding that does not require a separate draft model.
For a business, faster decoding means lower latency and lower serving cost for the same model, which directly affects how responsive and affordable an AI product feels to its users.