“Ego4D: Around the World in 3,000 Hours of Egocentric Video,” posted to arXiv on October 13, 2021 by Kristen Grauman and 84 co-authors across many institutions, assembled the first massive dataset of first-person video. Egocentric video is footage captured from a wearable camera at roughly eye level, the viewpoint a person or a robot actually has, which differs sharply from the curated third-person clips that dominated earlier vision datasets.
The collection contains 3,670 hours of daily-life activity recorded by 931 camera wearers across 74 locations in 9 countries, spanning household chores, outdoor activity, work, and leisure. Portions include extra signals such as audio, 3D environment meshes, eye-gaze tracking, stereo, and synchronized multi-camera capture of shared events. Ego4D also defined benchmark tasks organized around time: episodic memory about the past, hand-object interaction and social understanding in the present, and forecasting of future activity. The authors emphasized consent and de-identification procedures given the sensitive nature of first-person footage.
Ego4D mattered because progress in vision had been bottlenecked by the perspective of its data; machines that must help people or navigate the world need to understand the messy, first-person stream. For a general reader, it is foundational infrastructure for the wave of work on AR glasses, household robots, and assistants that perceive the world the way their users do.