Landmark Papers

What the papers actually said - linked to the originals.

644 entries, all primary-sourced

paper January 11, 2024

AMIE: towards conversational diagnostic AI

Google's 2024 AMIE paper described an LLM tuned for diagnostic dialogue that matched or beat primary care doctors in simulated text consultations.

paper January 17, 2024

Solving olympiad geometry without human demonstrations (AlphaGeometry)

DeepMind's AlphaGeometry, published in Nature in 2024, solved 25 of 30 olympiad geometry problems, near the level of human gold medalists.

paper January 19, 2024

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Depth Anything estimated depth from a single photo by auto-labeling 62 million unlabeled images, generalizing without per-scene tuning.

paper January 19, 2024

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Inference method that adds extra heads to predict several future tokens at once, cutting decoding steps for a 2 to 3.6x speedup.

paper January 23, 2024

Lumiere: A Space-Time Diffusion Model for Video Generation

Google's Lumiere generated a whole video in one pass with a Space-Time U-Net, instead of making keyframes and interpolating.

paper January 29, 2024

Corrective Retrieval Augmented Generation (CRAG)

RAG method that scores retrieved documents and falls back to web search or filtering when the retrieval looks unreliable.

paper January 31, 2024

Dolma: An Open Corpus of Three Trillion Tokens

AI2's fully open three-trillion-token pretraining corpus for the OLMo models, released with its curation toolkit so others can inspect and rebuild it.

paper February 27, 2024

AI Democracy Projects: chatbots give bad election answers (2024)

A 2024 study found leading AI chatbots gave inaccurate answers to basic voting questions about half the time.

paper March 7, 2024

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

The 2024 LMSYS paper behind the crowdsourced arena that ranks chatbots by anonymous head-to-head human votes and Elo-style scores.

paper March 11, 2024

Stealing Part of a Production Language Model

A 2024 paper that recovered hidden internals of closed models like ChatGPT and PaLM-2 using only their public APIs, for a few dollars.

paper March 13, 2024

Gemma: Open Models Based on Gemini Research and Technology

Google DeepMind's 2024 paper introducing Gemma, a family of lightweight open models built from the same research as Gemini.

paper March 19, 2024

TacticAI: an AI assistant for football tactics

DeepMind and Liverpool FC built a graph neural network for corner kicks; experts preferred its tactics 90% of the time.

paper March 21, 2024

AI and Memory Wall

Berkeley paper arguing memory bandwidth, not compute, is now the binding constraint on serving large AI models.

paper March 30, 2024

Aardvark Weather: end-to-end data-driven weather forecasting

A 2024 paper introduced Aardvark, the first machine-learning system to replace the entire weather pipeline from raw observations to final forecast.

paper April 2, 2024

Mixture-of-Depths: Dynamically Allocating Compute in Transformers

Mixture-of-Depths lets a Transformer route only some tokens through each layer, cutting compute per forward pass.

paper April 5, 2024

The Simple Macroeconomics of AI (Acemoglu)

Daron Acemoglu estimated AI would add no more than about 0.7% to total factor productivity over ten years, far below bullish forecasts.

paper April 17, 2024

Many-Shot In-Context Learning

The 2024 Agarwal et al. paper showing hundreds or thousands of in-context examples in long contexts beat few-shot prompting.

paper April 22, 2024

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Microsoft's 2024 report on Phi-3-mini, a 3.8B model that rivals far larger models and runs on a phone, trained on curated data.

paper April 24, 2024

GraphRAG: From Local to Global, A Graph RAG Approach to Query-Focused Summarization

Microsoft method that builds a knowledge graph and community summaries so RAG can answer broad questions about a whole corpus.

paper April 30, 2024

KAN: Kolmogorov-Arnold Networks

KANs put learnable activation functions on the edges of a network, aiming for more accurate and interpretable models than MLPs.

paper May 6, 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

The 2024 Princeton paper showing that a well-designed agent interface, not just a bigger model, is what lets LLMs fix real code.

paper May 7, 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek's 2024 MoE model introducing Multi-head Latent Attention, which compresses the KV cache for far cheaper inference.

paper May 13, 2024

The Platonic Representation Hypothesis

Argues that as models scale, their internal representations converge toward a shared statistical model of reality across modalities.

paper May 20, 2024

DIAMOND: Diffusion for World Modeling

DIAMOND trained reinforcement-learning agents inside a diffusion world model, setting a record on the Atari 100k benchmark.

paper May 20, 2024

Octo: An Open-Source Generalist Robot Policy

Octo is an open transformer robot policy trained on 800,000 Open X-Embodiment trajectories that fine-tunes to new robots in hours.

paper May 21, 2024

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

The 2024 Anthropic paper that pulled millions of interpretable features out of a production model, Claude 3 Sonnet, and steered behavior by tuning them.

paper May 30, 2024

Stanford study: legal AI research tools still hallucinate

A Stanford study found leading legal-research AI tools hallucinated on 17 to 34 percent of queries, despite vendor claims of being hallucination-free.

paper June 4, 2024

Situational Awareness: The Decade Ahead (Aschenbrenner, 2024)

Leopold Aschenbrenner's June 2024 essay series arguing that 'AGI by 2027 is strikingly plausible' and superintelligence could follow.

paper June 8, 2024

VALL-E 2: Human Parity Zero-Shot Text to Speech

Microsoft's 2024 VALL-E 2 claimed the first human-parity zero-shot voice cloning, copying a speaker from a short audio prompt.

paper June 11, 2024

AI Sandbagging: Language Models Can Strategically Underperform on Evaluations

Models can be made to fail dangerous-capability tests on purpose while keeping normal performance, undermining safety evals.

paper June 13, 2024

OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA, a 7B open vision-language-action model, beat the closed 55B RT-2-X by 16.5 points using seven times fewer parameters.

paper June 17, 2024

Refusal in Language Models Is Mediated by a Single Direction

Across 13 open chat models, refusal turns out to be controlled by one direction in activation space that can be added or removed.

paper July 24, 2024

AI Models Collapse When Trained on Recursively Generated Data

A 2024 Nature paper showed that training generative models on AI-generated data causes 'model collapse' - the tails of the data distribution vanish.

paper July 31, 2024

Gemma 2: Improving Open Language Models at a Practical Size

Google DeepMind's 2024 follow-up using knowledge distillation and attention tweaks to make small open models punch above their size.

paper July 31, 2024

The Llama 3 Herd of Models

Meta's 2024 paper describing the Llama 3 family, including a 405B open-weight model that rivals leading closed models.

paper August 6, 2024

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

The 2024 Snell et al. paper showing that spending more compute at inference can beat using a much larger model.

paper August 27, 2024

GameNGen: Diffusion Models Are Real-Time Game Engines

Google's GameNGen ran a playable version of DOOM at 20 fps using only a neural diffusion model, with no game engine.

paper October 11, 2024

Machines of Loving Grace (Dario Amodei, 2024)

Anthropic CEO Dario Amodei's 2024 essay arguing powerful AI could compress 50 to 100 years of progress into 5 to 10 years.

paper October 17, 2024

Movie Gen: A Cast of Media Foundation Models

Meta's Movie Gen is a 30-billion-parameter family that makes 1080p video with synchronized audio, plus editing and personalization.

paper October 25, 2024

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Anthropic extends sparse autoencoders across layers and across models, enabling cleaner circuits and model comparison.

paper October 28, 2024

Sabotage Evaluations for Frontier Models

Anthropic built tests for whether a model could quietly undermine human oversight, evaluation, or deployment decisions.

paper November 2, 2024

Rule Based Rewards for Language Model Safety

OpenAI uses explicit written rules plus an LLM grader to shape safe behavior with far less human feedback data.

paper November 14, 2024

Evo: sequence modeling and design from molecular to genome scale

The 2024 Science paper introducing Evo, a 7-billion-parameter DNA foundation model trained on millions of microbial genomes.

paper December 4, 2024

Probabilistic weather forecasting with machine learning (GenCast)

DeepMind's GenCast, published in Nature in 2024, is a diffusion model that produces ensemble weather forecasts more skillful than the leading ensemble system.

paper December 6, 2024

Frontier Models Are Capable of In-Context Scheming

Apollo Research showed several frontier models, including o1, will deceive and scheme to pursue goals when prompted toward them.

paper December 18, 2024

Alignment Faking in Large Language Models

A 2024 paper found Claude 3 Opus would sometimes strategically comply during training to avoid having its values changed.

paper December 20, 2024

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI trains reasoning models to recall and reason over written safety policies before answering, improving both safety and helpfulness.

paper December 27, 2024

DeepSeek-V3 Technical Report

DeepSeek's 2024 report on a 671B-parameter MoE model trained for under 2.8 million GPU hours that rivals top closed models.