Gaussian Error Linear Units (GELUs)
The 2016 Hendrycks-Gimpel paper introducing the GELU activation function, the smooth nonlinearity used in BERT, GPT, and most modern Transformers.
What the papers actually said - linked to the originals.
The 2016 Hendrycks-Gimpel paper introducing the GELU activation function, the smooth nonlinearity used in BERT, GPT, and most modern Transformers.
Introduced DP-SGD, a way to train neural networks with a provable privacy guarantee by clipping and adding noise to gradients.
node2vec learns node embeddings using biased random walks that flexibly balance local and global graph structure.
Word embeddings built from character n-grams, giving vectors to rare and unseen words and capturing word morphology.
The 2016 Ba-Kiros-Hinton paper introducing layer normalization, the per-example normalization that became standard inside every Transformer.
A study showing that word embeddings learned from news text encoded gender stereotypes, plus a method to reduce them.
The 2016 back-translation paper showed translating monolingual text backwards makes cheap synthetic training pairs that boost neural translation.
The 2016 Carlini and Wagner paper introducing the C&W attacks, which broke defensive distillation and set the bar for evaluating defenses.
The 2016 paper that connected every layer to every later layer, improving gradient flow and feature reuse while cutting parameter counts.
Kipf and Welling's GCN, a simple and scalable way to run convolution-like layers directly on graph data.
A 2016 paper showing an attacker can copy a paid machine learning model with near-perfect fidelity just by querying its prediction API.
DeepMind's 2016 WaveNet generated raw audio sample by sample, jumping past the old vocoders and reshaping text-to-speech.
Marblestone, Wayne, and Kording argued the brain optimizes cost functions, and that deep learning offers a framework for understanding it.
SRGAN, the 2016 paper that used a GAN and a perceptual loss to hallucinate photo-realistic texture at 4x upscaling.
The paper proving that, in general, a risk score cannot satisfy several natural fairness conditions at once.
A 2016 paper trained a CNN on 54,306 leaf images to identify 26 crop diseases, reaching 99.35% accuracy in the lab.
The paper that introduced equalized odds and equal opportunity, two of the most widely used group-fairness criteria for classifiers.
The 2016 Selvaraju paper that highlights which image regions a convolutional network used for a prediction, using gradients.
A 2016 paper showing an attacker can tell whether a specific record was in a model's training set using only black-box query access.
Introduced PATE, a privacy method where many teacher models trained on private data vote, with noise, to teach a public student model.
A 2016 paper showing a single fixed perturbation can fool a vision classifier on most natural images at once, not just one chosen image.
Zoph and Le's 2016 paper that used a reinforcement-learning controller to automatically design neural network architectures.
Lillicrap and colleagues showed that fixed random feedback weights can carry learning signals nearly as well as backpropagation, easing a biological objection.
The 2016 paper that added cardinality - the number of parallel transformation paths - as a new dimension for scaling convolutional networks.
pix2pix used a conditional GAN to turn sketches, maps, and labels into photo-like images with a single general-purpose method.
OpenPose detected the 2D body poses of everyone in an image in real time, using Part Affinity Fields to link joints to people.
A 2016 paper showing that an AI kept uncertain about its true objective has an incentive to let humans switch it off rather than resist.
PointNet was the first network to learn directly on raw 3D point clouds, respecting that points have no inherent order.
The 2016 paper that built a multi-scale feature pyramid inside a network, letting detectors find small and large objects at little extra cost.
Google's 2016 JAMA study trained a CNN on 128,175 retinal photos to detect diabetic retinopathy with over 90 percent sensitivity and specificity.
SampleRNN generated raw audio one sample at a time with stacked recurrent networks, an early rival to WaveNet's approach.
The 2017 Shazeer-led paper that made mixture-of-experts practical, routing each input to a few specialized sub-networks to reach 137 billion parameters.
WGAN reformulated GAN training around the Wasserstein distance, giving more stable training and a meaningful loss curve.
The 2017 PCGML survey defines procedural content generation via machine learning, training models on existing game levels.
The 2017 paper behind FAISS, the open-source library that made nearest-neighbor search over a billion vectors practical by running it on GPUs.
Acemoglu and Restrepo found each additional industrial robot per thousand workers measurably lowered US local employment and wages.
OpenAI's 2017 paper trained a robot vision model on randomized non-realistic simulation and transferred it to reality at 1.5 cm accuracy.
Mask R-CNN added a mask branch to Faster R-CNN, predicting a pixel-level outline for every detected object in one pass.
CycleGAN translated images between domains without paired examples, using a cycle-consistency loss to preserve content.
Gilmer and colleagues unified many graph networks into the Message Passing Neural Network framework for molecular prediction.
Amazon's DeepAR, a recurrent network that learns from many related time series at once and outputs probability distributions, not just point forecasts.
Google's ISCA 2017 paper revealing the first TPU, a custom inference chip running in its datacenters since 2015.
The 2017 Google paper using depthwise separable convolutions to build small, fast vision models that run on phones and embedded devices.
Peters et al. add pretrained bidirectional language-model embeddings to taggers, a direct precursor to ELMo.
The 2017 Lundberg-Lee paper that grounds feature-importance explanations in game theory using Shapley values.
The 2017 paper arguing that adaptive optimizers like Adam can generalize worse than plain SGD despite training faster.
GraphSAGE learns to sample and aggregate neighbor features so embeddings generalize to nodes unseen during training.
The 2017 Google paper that introduced the Transformer architecture, the foundation of virtually all modern large language models.