OSWorld

OSWorld, full title “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments,” was posted to arXiv on April 11, 2024 by a team led by Tianbao Xie and Tao Yu, with collaborators including authors from the University of Hong Kong and Salesforce Research. It provides a real, scalable computer environment in which a multimodal agent must operate an actual desktop the way a person would.

The benchmark comprises 369 computer tasks that run on real Ubuntu, Windows, and macOS systems, involving genuine web and desktop applications, file input and output, and workflows that span multiple programs. Crucially, tasks are graded by execution - the environment checks whether the desired end state was actually achieved - rather than by matching a fixed answer string, which is what makes open-ended computer use measurable. The results showed a wide gap: humans completed over 72 percent of the tasks while the best model managed only about 12 percent, with the main failures traced to weak GUI grounding (knowing where to click) and missing operational knowledge of how applications behave.

OSWorld became a standard reference for the computer-use line of agents - systems that drive a screen with mouse and keyboard rather than calling clean APIs. Its execution-based grading and real operating systems made it a demanding and credible test as vendors began shipping agents that claim to operate a computer on a user’s behalf.

Sources

Related