Ego4D

“Ego4D: Around the World in 3,000 Hours of Egocentric Video,” posted to arXiv on October 13, 2021 by Kristen Grauman and 84 co-authors across many institutions, assembled the first massive dataset of first-person video. Egocentric video is footage captured from a wearable camera at roughly eye level, the viewpoint a person or a robot actually has, which differs sharply from the curated third-person clips that dominated earlier vision datasets.

The collection contains 3,670 hours of daily-life activity recorded by 931 camera wearers across 74 locations in 9 countries, spanning household chores, outdoor activity, work, and leisure. Portions include extra signals such as audio, 3D environment meshes, eye-gaze tracking, stereo, and synchronized multi-camera capture of shared events. Ego4D also defined benchmark tasks organized around time: episodic memory about the past, hand-object interaction and social understanding in the present, and forecasting of future activity. The authors emphasized consent and de-identification procedures given the sensitive nature of first-person footage.

Ego4D mattered because progress in vision had been bottlenecked by the perspective of its data; machines that must help people or navigate the world need to understand the messy, first-person stream. For a general reader, it is foundational infrastructure for the wave of work on AR glasses, household robots, and assistants that perceive the world the way their users do.

Sources

Last verified June 7, 2026