GPT-4o can respond to audio in as little as 232 milliseconds

The GPT-4o System Card reports that the model β€œcan respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.” This low latency was achieved by training a single model end-to-end across text, vision, and audio, so speech did not have to pass through a separate transcription-and-synthesis pipeline before the model could reply.

Sources

Last verified June 6, 2026