WebVoyager, introduced in “WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models” on arXiv on January 25, 2024 by Hongliang He, Wenlin Yao, Kaixin Ma, and colleagues, and accepted to ACL 2024, is both a web-browsing agent and a benchmark for testing such agents. Its distinguishing idea is that the agent sees the page the way a person does, through screenshots, and uses a large multimodal model to decide where to click and what to type, rather than reading raw HTML alone.
The associated benchmark compiles real-world tasks from 15 popular live websites, so the agent must cope with the actual rendered web rather than a simplified sandbox. The authors also proposed an automatic evaluation method that uses GPT-4V to judge whether a task was completed from the agent’s screenshots, reducing the need for slow human grading. Their WebVoyager agent reached a 59.1 percent task success rate, beating text-only agents and a GPT-4 setup with all tools enabled.
WebVoyager matters because it pushed web agents toward vision: many sites are hard to operate from HTML alone, and looking at the rendered screen is often more robust. This screenshot-driven approach foreshadowed the computer-use agents that arrived later the same year. For a business reader, it points at the goal of an assistant that can drive any website by seeing it, just as an employee would, without custom integrations.