WebVoyager

WebVoyager, introduced in “WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models” on arXiv on January 25, 2024 by Hongliang He, Wenlin Yao, Kaixin Ma, and colleagues, and accepted to ACL 2024, is both a web-browsing agent and a benchmark for testing such agents. Its distinguishing idea is that the agent sees the page the way a person does, through screenshots, and uses a large multimodal model to decide where to click and what to type, rather than reading raw HTML alone.

The associated benchmark compiles real-world tasks from 15 popular live websites, so the agent must cope with the actual rendered web rather than a simplified sandbox. The authors also proposed an automatic evaluation method that uses GPT-4V to judge whether a task was completed from the agent’s screenshots, reducing the need for slow human grading. Their WebVoyager agent reached a 59.1 percent task success rate, beating text-only agents and a GPT-4 setup with all tools enabled.

WebVoyager matters because it pushed web agents toward vision: many sites are hard to operate from HTML alone, and looking at the rendered screen is often more robust. This screenshot-driven approach foreshadowed the computer-use agents that arrived later the same year. For a business reader, it points at the goal of an assistant that can drive any website by seeing it, just as an employee would, without custom integrations.

Sources

Last verified June 7, 2026