AMIE: towards conversational diagnostic AI
Google's 2024 AMIE paper described an LLM tuned for diagnostic dialogue that matched or beat primary care doctors in simulated text consultations.
What the papers actually said - linked to the originals.
Google's 2024 AMIE paper described an LLM tuned for diagnostic dialogue that matched or beat primary care doctors in simulated text consultations.
DeepMind's AlphaGeometry, published in Nature in 2024, solved 25 of 30 olympiad geometry problems, near the level of human gold medalists.
Depth Anything estimated depth from a single photo by auto-labeling 62 million unlabeled images, generalizing without per-scene tuning.
Inference method that adds extra heads to predict several future tokens at once, cutting decoding steps for a 2 to 3.6x speedup.
Google's Lumiere generated a whole video in one pass with a Space-Time U-Net, instead of making keyframes and interpolating.
RAG method that scores retrieved documents and falls back to web search or filtering when the retrieval looks unreliable.
AI2's fully open three-trillion-token pretraining corpus for the OLMo models, released with its curation toolkit so others can inspect and rebuild it.
A 2024 study found leading AI chatbots gave inaccurate answers to basic voting questions about half the time.
The 2024 LMSYS paper behind the crowdsourced arena that ranks chatbots by anonymous head-to-head human votes and Elo-style scores.
A 2024 paper that recovered hidden internals of closed models like ChatGPT and PaLM-2 using only their public APIs, for a few dollars.
Google DeepMind's 2024 paper introducing Gemma, a family of lightweight open models built from the same research as Gemini.
DeepMind and Liverpool FC built a graph neural network for corner kicks; experts preferred its tactics 90% of the time.
Berkeley paper arguing memory bandwidth, not compute, is now the binding constraint on serving large AI models.
A 2024 paper introduced Aardvark, the first machine-learning system to replace the entire weather pipeline from raw observations to final forecast.
Mixture-of-Depths lets a Transformer route only some tokens through each layer, cutting compute per forward pass.
Daron Acemoglu estimated AI would add no more than about 0.7% to total factor productivity over ten years, far below bullish forecasts.
The 2024 Agarwal et al. paper showing hundreds or thousands of in-context examples in long contexts beat few-shot prompting.
Microsoft's 2024 report on Phi-3-mini, a 3.8B model that rivals far larger models and runs on a phone, trained on curated data.
Microsoft method that builds a knowledge graph and community summaries so RAG can answer broad questions about a whole corpus.
KANs put learnable activation functions on the edges of a network, aiming for more accurate and interpretable models than MLPs.
The 2024 Princeton paper showing that a well-designed agent interface, not just a bigger model, is what lets LLMs fix real code.
DeepSeek's 2024 MoE model introducing Multi-head Latent Attention, which compresses the KV cache for far cheaper inference.
Argues that as models scale, their internal representations converge toward a shared statistical model of reality across modalities.
DIAMOND trained reinforcement-learning agents inside a diffusion world model, setting a record on the Atari 100k benchmark.
Octo is an open transformer robot policy trained on 800,000 Open X-Embodiment trajectories that fine-tunes to new robots in hours.
The 2024 Anthropic paper that pulled millions of interpretable features out of a production model, Claude 3 Sonnet, and steered behavior by tuning them.
A Stanford study found leading legal-research AI tools hallucinated on 17 to 34 percent of queries, despite vendor claims of being hallucination-free.
Leopold Aschenbrenner's June 2024 essay series arguing that 'AGI by 2027 is strikingly plausible' and superintelligence could follow.
Microsoft's 2024 VALL-E 2 claimed the first human-parity zero-shot voice cloning, copying a speaker from a short audio prompt.
Models can be made to fail dangerous-capability tests on purpose while keeping normal performance, undermining safety evals.
OpenVLA, a 7B open vision-language-action model, beat the closed 55B RT-2-X by 16.5 points using seven times fewer parameters.
Across 13 open chat models, refusal turns out to be controlled by one direction in activation space that can be added or removed.
A 2024 Nature paper showed that training generative models on AI-generated data causes 'model collapse' - the tails of the data distribution vanish.
Google DeepMind's 2024 follow-up using knowledge distillation and attention tweaks to make small open models punch above their size.
Meta's 2024 paper describing the Llama 3 family, including a 405B open-weight model that rivals leading closed models.
The 2024 Snell et al. paper showing that spending more compute at inference can beat using a much larger model.
Google's GameNGen ran a playable version of DOOM at 20 fps using only a neural diffusion model, with no game engine.
Anthropic CEO Dario Amodei's 2024 essay arguing powerful AI could compress 50 to 100 years of progress into 5 to 10 years.
Meta's Movie Gen is a 30-billion-parameter family that makes 1080p video with synchronized audio, plus editing and personalization.
Anthropic extends sparse autoencoders across layers and across models, enabling cleaner circuits and model comparison.
Anthropic built tests for whether a model could quietly undermine human oversight, evaluation, or deployment decisions.
OpenAI uses explicit written rules plus an LLM grader to shape safe behavior with far less human feedback data.
The 2024 Science paper introducing Evo, a 7-billion-parameter DNA foundation model trained on millions of microbial genomes.
DeepMind's GenCast, published in Nature in 2024, is a diffusion model that produces ensemble weather forecasts more skillful than the leading ensemble system.
Apollo Research showed several frontier models, including o1, will deceive and scheme to pursue goals when prompted toward them.
A 2024 paper found Claude 3 Opus would sometimes strategically comply during training to avoid having its values changed.
OpenAI trains reasoning models to recall and reason over written safety policies before answering, improving both safety and helpfulness.
DeepSeek's 2024 report on a 671B-parameter MoE model trained for under 2.8 million GPU hours that rivals top closed models.