“Segment Anything,” posted to arXiv in April 2023 by Alexander Kirillov and colleagues at Meta AI, applied the foundation-model playbook from language to image segmentation: build one large, general model that can be prompted to do many tasks without retraining. The project delivered a new task, a model, and a dataset together.
The Segment Anything Model, or SAM, is “promptable.” Given an image and a prompt - a click, a box, or a rough mask - it outputs precise masks for the indicated objects, and it does this zero-shot on images and categories it was never explicitly trained on. Architecturally it pairs a heavy image encoder (a Vision Transformer) with a lightweight prompt encoder and mask decoder, so the expensive image processing is done once and then queried cheaply for many prompts. To train it, Meta built SA-1B, the largest segmentation dataset to date, with more than 1 billion masks across 11 million licensed, privacy-respecting images, generated through a data engine in which the model and human annotators bootstrapped each other.
SAM showed that segmentation - long a task requiring a model trained per dataset - could be handled by a single general model good enough to often match fully supervised systems. It became a widely used building block for image and video editing, medical imaging, scientific analysis, and robotics, where it serves as a reusable “cut out any object” primitive. The release continued the trend of treating vision problems as something a large pretrained model can solve by prompting rather than by bespoke training.