ImageBind, published in May 2023 by a Meta FAIR team including Rohit Girdhar and Ishan Misra, learned a single joint embedding space spanning six different modalities: images, text, audio, depth, thermal, and IMU (motion-sensor) data. Everything from a photo to a sound clip to a depth map could be mapped into one shared space where related items land near each other.
The clever part is the training shortcut. Building a joint space across six modalities would naively require paired data for every combination - audio-with-depth, text-with-thermal, and so on - which mostly does not exist. ImageBind showed that all those pairings are unnecessary: only image-paired data is sufficient to bind the modalities together. Because images naturally co-occur with text, audio, depth, and the rest, training each modality to align with images implicitly aligns the modalities with each other. This is sometimes called emergent alignment.
The payoff was zero-shot cross-modal behavior the model was never explicitly trained for - retrieving images from sounds, or composing modalities arithmetically - plus cross-modal detection and generation, all from one embedding space.
Why business readers should care: ImageBind pointed at a general recipe for multimodal systems - bind new sensor types to a shared space cheaply by pairing them with images. That matters for robotics, AR devices, and any product that fuses cameras, microphones, and depth or motion sensors without collecting every pairwise dataset.