Visual Instruction Tuning (LLaVA)

LLaVA (Large Language and Vision Assistant), introduced by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee in April 2023, is an end-to-end trained multimodal model that connects a vision encoder and a large language model for general-purpose visual and language understanding. It became one of the most widely copied open recipes for turning a text LLM into a model that can see.

The paper’s central idea is visual instruction tuning. Instruction tuning had made text-only models follow natural commands; LLaVA extended it to images. The trick was to use the language-only GPT-4 to generate multimodal instruction-following data - turning image annotations into conversations, detailed descriptions, and reasoning questions - then fine-tune the combined vision-plus-language model on that synthetic data. Architecturally, LLaVA pairs a CLIP visual encoder with a language model through a simple projection layer, a deliberately lightweight bridge.

On a synthetic multimodal instruction-following benchmark, LLaVA reached 85.1 percent of GPT-4’s relative score, and when fine-tuned on Science QA it hit 92.53 percent accuracy. The work was accepted as a NeurIPS 2023 oral.

Why business readers should care: LLaVA showed that a small team could stand up a competent visual assistant by gluing existing open models together and training on data an LLM wrote for itself. That cheap, reproducible recipe seeded a large ecosystem of open multimodal models and lowered the barrier to building products that reason over images.

Visual Instruction Tuning (LLaVA)

Sources

Related