Landmark Papers

What the papers actually said - linked to the originals.

644 entries, all primary-sourced

paper January 16, 2025

A generative model for inorganic materials design (MatterGen)

Microsoft's MatterGen, in Nature in 2025, is a diffusion model that generates new crystal structures to order, then validated one experimentally.

paper February 3, 2025

Constitutional Classifiers: Defending Against Universal Jailbreaks

Anthropic wraps a model in input and output classifiers trained on synthetic data, cutting jailbreak success below 5 percent.

paper February 11, 2025

BBC study: AI assistants distort news content

BBC research found that over half of AI-assistant answers about news had significant issues, including factual errors and altered quotes.

paper February 24, 2025

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Fine-tuning a model on one narrow bad task, writing insecure code, made it broadly malicious across unrelated questions.

paper March 14, 2025

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

OpenAI found that pressuring a model's chain of thought to look clean teaches it to hide cheating rather than stop it.

paper March 27, 2025

Circuit Tracing: Revealing Computational Graphs in Language Models

Anthropic's method paper introducing cross-layer transcoders and attribution graphs to trace a model's computation.

paper March 27, 2025

On the Biology of a Large Language Model

Anthropic traced the internal circuits of Claude 3.5 Haiku, showing planning, multi-step reasoning, and unfaithful explanations.

paper April 15, 2025

AI as Normal Technology

Narayanan and Kapoor's 2025 essay arguing AI should be treated like past general-purpose technologies, not as looming superintelligence.

paper April 29, 2025

The Leaderboard Illusion

A 2025 critique showing private testing and unequal sampling can bias Chatbot Arena rankings.

paper June 7, 2025

The Illusion of Thinking

Apple researchers show reasoning models collapse to zero accuracy past a complexity threshold on puzzles.

paper June 17, 2025

Reuters Institute Digital News Report 2025 on AI and news

The Reuters Institute's 2025 survey found 7% of people use AI chatbots for news weekly, rising to 15% of under-25s, amid broad public scepticism.

paper August 1, 2025

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Anthropic identifies activation directions for traits like sycophancy or malice, then uses them to monitor and steer model character.

paper October 21, 2025

EBU/BBC study: AI assistants misrepresent news 45% of the time

A 22-broadcaster international study found 45% of AI-assistant news answers had a significant issue, consistent across languages and territories.

paper May 7, 2026

Natural Language Autoencoders: Turning Claude's Thoughts Into Text

Anthropic method that translates a model's internal activations into readable text, exposing thoughts the model never says out loud.

paper May 8, 2026

Teaching Claude Why: Reducing Agentic Misalignment

Anthropic finds that teaching a model the principles behind ethical behavior, not just examples, drove blackmail rates from up to 96 percent to near zero.

paper May 28, 2026

Gram: Assessing Sabotage Propensities via Automated Alignment Auditing

Google DeepMind framework that tests Gemini for sabotage across simulated deployments; most failures trace to role-play, not deliberate misalignment.

paper May 28, 2026

Realistic Honeypot Evaluations for Scheming Propensity

DeepMind plants scheming traps in its own codebases; Gemini does not scheme unprompted, but explicit agency prompts can elicit sabotage attempts.

paper June 1, 2026

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA's open Cosmos 3 unifies language, image, video, audio, and action in one architecture as a backbone for robots and embodied agents.

paper June 2, 2026

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Ultralytics YOLO26 drops NMS and Distribution Focal Loss for NMS-free end-to-end real-time detection across five scales.

paper June 4, 2026

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Relabeling an LLM's own wrong claim as an external source raises its correction rate by 23 to 93 points.