ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
The 2019 Microsoft paper introducing ZeRO, the memory-partitioning technique behind DeepSpeed that made training models with 100B-plus parameters feasible.
What the papers actually said - linked to the originals.
The 2019 Microsoft paper introducing ZeRO, the memory-partitioning technique behind DeepSpeed that made training models with 100B-plus parameters feasible.
Stuart Russell's 2019 book argues that AI should be rebuilt to be uncertain about human preferences rather than to optimize a fixed objective.
The 2019 Google paper introducing T5, which casts every NLP task as text-in, text-out, and the C4 web corpus used to train it.
An encoder-decoder model pretrained by corrupting text and learning to reconstruct it, strong at summarization and generation.
Francois Chollet's 2019 paper defines intelligence as skill-acquisition efficiency and introduces the ARC benchmark.
Facebook's 2019 paper showed one Transformer trained on 100 languages can beat multilingual BERT without hurting per-language quality.
Meta's Demucs split a song into vocals, drums, bass, and other by working on the raw waveform instead of spectrograms.
StyleGAN2 removed the blob-like artifacts of the original StyleGAN and set a new bar for photorealistic face synthesis.
The 2019 Dreamer paper learned a world model from images and trained behaviors by imagining rollouts in its latent space.
The 2019 NeurIPS paper describing PyTorch's design, arguing that an imperative, Pythonic framework can be both easy to use and fast on GPUs.
The 2019 OpenAI paper showing that as models grow, test error first worsens then improves again, breaking the classic bias-variance tradeoff.
Introduced SISA, a training design that lets a model efficiently forget specific data without retraining from scratch.
Lim and colleagues' attention-based forecasting model that handles mixed inputs and stays interpretable for multi-horizon time-series prediction.
Google Magenta's DDSP put classic synthesizer building blocks inside a neural network, enabling pitch and timbre control.
The 2020 Nature paper describing the first AlphaFold, which won CASP13 by predicting protein shapes with deep-learned distance potentials.
The 2020 OpenAI paper that found language-model loss falls as a smooth power law in model size, data, and compute over seven orders of magnitude.
Google paper that pre-trains a language model together with a document retriever, learning to look things up instead of memorizing.
MIT's 2020 Cell paper used a neural network to screen molecules and discover halicin, a structurally novel antibiotic effective against resistant bacteria.
Google's 2020 paper that used evolutionary search to rediscover ML algorithms like neural nets and backpropagation from raw math operations.
The 2020 NeRF paper fit a small neural network to a handful of photos and rendered photorealistic novel views of a 3D scene.
Replaces masked-word prediction with detecting fake tokens, learning far more efficiently than BERT.
Showed that learned dense embeddings can beat keyword search like BM25 at finding relevant passages.
Transformer that replaces full attention with a sliding window plus global attention, scaling to long documents linearly.
A retrieval model that keeps BERT's accuracy but precomputes document representations for fast search.
Google's 2020 Conformer combined convolution and self-attention to set new accuracy records on the LibriSpeech speech benchmark.
The 2020 Lewis et al. paper that coined RAG, combining a seq2seq model with a learned retriever over a Wikipedia vector index.
DETR treated object detection as direct set prediction with a transformer, dropping the hand-tuned anchors and NMS earlier detectors needed.
The 2020 OpenAI paper introducing GPT-3, a 175-billion-parameter model that performed new tasks from a few examples in its prompt, with no retraining.
Separates word content and position into distinct vectors and was first to top the human baseline on SuperGLUE.
The 2020 CQL paper made offline RL reliable by learning a Q-function that lower-bounds true value, curbing overestimation.
Microsoft's 2020 FastSpeech 2 dropped the teacher-student trick and conditioned on pitch, energy, and duration for better fast TTS.
A program synthesis system that learns its own library of concepts while solving problems, guided by neural search.
An attention network for 3D point clouds and graphs whose predictions stay consistent under rotation and translation.
The 2020 DDPM paper made diffusion models work for high-quality image generation, setting a new FID record on CIFAR-10 and seeding the diffusion era.
Meta's wav2vec 2.0 learned speech from raw unlabeled audio, then matched strong systems with only minutes of labeled data.
A modern continuous Hopfield network that stores exponentially many patterns and whose update rule equals Transformer attention.
Google sparse-attention transformer combining window, random, and global attention to handle sequences up to 8x longer.
The 2020 OpenAI paper applying RLHF to text summarization, showing models tuned on human preferences beat much larger supervised models.
Used a transformer to prove formal theorems in Metamath, the first deep-learning proofs adopted by a math community.
COMET is a 2020 neural metric that scores machine translations by how well they match human judgments, beating older word-overlap metrics.
Performers approximate softmax attention in linear time and memory using random features, with provable accuracy.
The 2020 DreamerV2 paper was the first model-based agent to reach human-level Atari performance using a learned world model.
HiFi-GAN turned spectrograms into 22kHz audio about 168 times faster than real time on a GPU at near-human quality.
The 2020 paper that learned to solve whole families of PDEs by parameterizing the solver in Fourier space, far faster than classical solvers.
A neural controller drove the ANYmal robot over mud, snow, rubble, and vegetation it never saw in training, using only touch and joint feel.
Facebook's M2M-100 was the first single model to translate directly between any pair of 100 languages without routing through English.
The 2020 Google paper that applied the Transformer directly to image patches, showing ViT can match or beat CNNs given enough pretraining data.
Google's mT5 extended the T5 text-to-text recipe to 101 languages, becoming a widely used open multilingual model.