Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo, introduced by a DeepMind team led by Jean-Baptiste Alayrac in April 2022, is a family of visual language models built to learn new tasks from only a handful of examples. Rather than training a single giant model from scratch, Flamingo connects a powerful pretrained vision-only model and a pretrained language-only model, keeping both frozen and learning only the new components that let them talk to each other.

The key design choice is that Flamingo accepts sequences of arbitrarily interleaved images and text - the same shape as a web page. Training on large multimodal web corpora of interleaved image-text gives the model in-context few-shot learning: you prompt it with a few image-question-answer examples and it generalizes to the new one, the same way GPT-3 learned tasks from text prompts. A single Flamingo model can do visual question answering, image captioning, and multiple-choice questions just by being shown examples.

The headline result was that few-shot Flamingo outperformed models that had been fine-tuned on thousands of times more task-specific labeled data, setting a new state of the art across multiple vision-language benchmarks without any task-specific training. This made it a landmark demonstration that the few-shot, prompt-driven paradigm from language models transferred to the multimodal setting.

Why business readers should care: Flamingo showed that the expensive part of multimodal AI - the vision and language backbones - could be reused frozen, with only a thin adapter trained on top. That bootstrapping pattern, refined further by BLIP-2 and the LLaVA line, is why capable image-understanding assistants became cheap enough to build on top of existing models rather than from scratch.

Flamingo: a Visual Language Model for Few-Shot Learning

Sources

Related