WebLI is the training dataset behind PaLI, introduced in “PaLI: A Jointly-Scaled Multilingual Language-Image Model,” submitted to arXiv on September 14, 2022 by Xi Chen, Xiao Wang, Soravit Changpinyo, and colleagues at Google. While many vision-language datasets are English-centric, WebLI was built to be broadly multilingual, with image-text pairs in over 100 languages drawn from the public web.
The dataset is web-scale. The paper describes “a new image-text training set containing 10B images and texts in over 100 languages,” which let the authors scale both the model and its training data together. Beyond raw alt-text annotations, the pipeline also ran optical character recognition on the images, adding text read directly from pictures - useful for tasks that require reading signs, documents, or labels within an image.
WebLI underpinned PaLI’s strong results across captioning, visual question answering, OCR, and other tasks in many languages, illustrating how multilingual web data can produce models that work well beyond English. For a general reader, WebLI is a clear example of the scale and breadth of modern training data, and of how quietly the multilingual web has become the raw material for systems used worldwide.