BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Encoders and LLMs

BLIP-2, published in January 2023 by Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi at Salesforce Research, made vision-language pretraining dramatically cheaper by reusing off-the-shelf models. Rather than training a large multimodal model end to end, it keeps a pretrained image encoder frozen and a pretrained large language model frozen, and learns only a small module that connects them.

That connector is the Querying Transformer, or Q-Former: a lightweight transformer with a set of learnable query vectors that extract the most language-relevant visual features from the frozen image encoder and hand them to the frozen LLM. It is trained in two stages - first to align visual features with text, then to feed those features into the language model so it can generate grounded text. Because only the Q-Former is trained, the cost is tiny relative to the frozen backbones.

The efficiency result was striking: BLIP-2 outperformed DeepMind’s 80-billion-parameter Flamingo by 8.7 percent on zero-shot VQAv2 while using 54 times fewer trainable parameters, and set state-of-the-art results across vision-language tasks.

Why business readers should care: BLIP-2 generalized the frozen-backbone idea into a reusable bridge - swap in a better image encoder or a better LLM and retrain only the small connector. That modularity is why capable image-understanding models could be assembled quickly and cheaply from components, and the pattern shows up across the open multimodal ecosystem.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Encoders and LLMs

Sources

Related