On May 13, 2024, OpenAI released GPT-4o, where the “o” stands for “omni.” The GPT-4o System Card, published to arXiv on October 25, 2024, describes it as “an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs.” Crucially, “all inputs and outputs are processed by the same neural network” - a single model trained end-to-end across modalities rather than a pipeline of separate systems.
The headline capability was real-time voice interaction. GPT-4o “can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds,” close to human conversational response time. On text and code it matched GPT-4 Turbo while being faster and cheaper, and the system card notes it is “especially better at vision and audio understanding compared to existing models.”
GPT-4o mattered as the moment native multimodality and low-latency voice became a mainstream product feature rather than a research demo. By folding speech, vision, and text into one model, it pointed toward conversational assistants that listen and see in real time, and set expectations that competitors moved to meet.
(Sourcing note: the System Card on arXiv is the fetchable Tier 1 primary. The canonical announcement openai.com/index/hello-gpt-4o/ is cited but openai.com blocks automated fetching; its May 13, 2024 date and claims were confirmed through multiple independent references to that page.)