Vision-Language Model

A vision-language model (VLM) is an AI system that takes images and text together and reasons across both - captioning a photo, answering a question about a chart, reading a screenshot, or following an instruction that refers to something in a picture. VLMs are the bridge between computer vision, which traditionally classified or detected objects, and language models, which manipulate text.

Most modern VLMs share a common shape: a vision encoder turns an image into features, a projection or adapter maps those features into the language model’s input space, and a large language model does the reasoning and produces text. The big practical insight, established by Flamingo in 2022 and refined by BLIP-2 in 2023, is that you do not need to train this from scratch. You can keep a pretrained vision encoder and a pretrained LLM frozen and train only the small connector between them - Flamingo’s cross-attention layers, BLIP-2’s Q-Former, or LLaVA’s simple projection. This makes building a capable image-understanding assistant far cheaper than it sounds.

A related but distinct lineage is contrastive models like CLIP and ALIGN, which learn a shared embedding space for images and text rather than generating language. Those provide the vision encoders and the zero-shot recognition that generative VLMs build on.

Why business readers should care: VLMs are what let a chatbot accept a photo and reason about it - reading documents and receipts, describing images for accessibility, checking visual quality on a production line, or guiding a user through a screenshot. Because they reuse frozen backbones, they became practical to deploy quickly, and image input is now a standard feature of frontier assistants.

Sources

Related