Ego4D Challenges 2022: A look back

In 2022, we ran the first set of challenges on the Ego4D dataset. In spite of being a brand new dataset (less than a year old!), we saw strong uptake and healthy competition, leading to significant improvements (upto 300%) over our initial baselines from only a few months ago. We organized workshops at premier international computer vision conferences (CVPR and ECCV) where we concluded the two challenges. This note summarizes these challenges, the outcomes, our learnings, key technical trends, as well as some of our next steps to continue this momentum into the future challenges.

TL;DR:

  • Ego4D is currently the largest egocentric video dataset, which was released publicly in February 2022. It was collected across 74 worldwide locations and 9 countries, and contains over 3,670 hours of daily-life activity video. It also introduced a number of benchmark tasks, designed to measure progress in egocentric video understanding.
  • We ran the first set of Ego4D challenges in 2022, on 16 different tasks, encompassing all 5 benchmarks in the dataset. It was run in 2 stages: a “teaser” challenge on 6 tasks at CVPR (concluded in June 2022), and a full challenge on all 16 tasks at ECCV (concluded in September 2022). Each challenge had a cash prize of $6K, with starter code and baselines available on github.
  • We got over 46 submissions from 18 teams in the ECCV challenge, a 170% and 50% respective increase over the teaser challenge run at CVPR.
  • State-of-the-art on all tasks improved considerably, with largest improvements on Episodic memory (Natural Language Queries and Moments Queries) and Hand and Objects (Object state change classification) tasks. These advancements would improve and enable new AR applications. For instance, episodic memory tasks focus on answering questions such as “where did I last see my keys?”, whereas Hand and Objects tasks focus on understanding and helping with the actions the camera wearer is doing, for example cooking, building furniture, etc.
  • We organized workshops at CVPR and ECCV to share results of the challenges, as well as to bring together eminent speakers and researchers from related areas to share their learnings and experience with egocentric computer vision and Ego4D. The list of winners, along with their reports and code has been published on the CVPR and ECCV workshop websites.
  • Interesting technical trends emerged from multiple winning solutions, including large-scale vision-language pretraining (especially on egocentric videos from Ego4D), self-supervised learning using masked autoencoding, and adoption of transformer based architectures.
  • The challenges have been re-opened for a future challenge. A subset of the challenges will also feature a never before seen test set! Stay tuned to the Ego4D Forum for updates.
  • Running a challenge of this scale was a considerable undertaking and we benefited from the contributions of 16 collaborators from the Ego4D consortium, representing eight different universities and Meta (Reality Labs and FAIR). Many thanks to all organizers and POCs for making these challenges a success!

The Challenges

The challenges encompassed the 5 benchmark tasks in Ego4D. All challenges were hosted on EvalAI, and came with starter code to reproduce baselines, which was available on GitHub. Each challenge also had a cash prize for the top-3 winners, amounting to $3K, $2K and $1K respectively. The CVPR challenge ran from March 1, 2022 to June 1, 2022; while the ECCV challenge ran from July 1, 2022 to September 18, 2022. We briefly describe the challenges next; please see the linked EvalAI pages or the paper for more details on the tasks.

Benchmark 1: Episodic Memory

Episodic memory challenges included the following tasks: Visual Queries 2D (VQ; POC: Santhosh Kumar Ramakrishnan, University of Texas, Austin), Visual Queries 3D (VQ3D; POC: Vincent Cartillier, Georgia Tech), Natural Language Queries (NLQ; POC: Satwik Kottur, FAIR), and Moment Queries (MQ; POC: Chen Zhao, KAUST). These tasks require reasoning about the past experience of the camera wearer, and answer questions such as when did they see a certain object, observe a certain action/moment, or other general natural language questions about the video.

Benchmark 2: Hands + Objects

Hands+Objects challenges focused on understanding human-object interactions through the lens of understanding the objects, tools, hands, and their interactions. A crucial element of each hand-object interaction is the “Point of No Return” (PNR), the time point where the state change is considered to be inevitable. The challenge tasks include: PNR temporal localization (PNR; POC: Yifei Huang, University of Tokyo), Object State-change Classification (OSCC; POC: Siddhant Bansal, IIIT-Hyderabad) and State-change Object Detection (SCOD; POC: Qichen Fu, CMU). These tasks require identifying whether a PNR exists in a clip, where it exists in a clip, and detecting objects in the PNR frame.

Benchmark 3: Audio-Visual (AV)

The audio-visual understanding benchmark focuses on understanding human-human interactions using the audio-visual information in videos, through detecting/tracking the speakers, transcribing the speech etc. This included tasks such as AV Localization (AVLoc; POC: Hao Jiang, Meta Reality Labs), Audio-only Diarization (ADiar; POC: Jachym Kolar, Meta Reality Labs), AV Diarization (AVDiar; POC: Hao Jiang, Meta Reality Labs), and AV Transcription (AVTrans; POC: Leda Sari, Meta Reality Labs).

Benchmark 4: Social

The social benchmark is related to the audio benchmark, and focuses on using audio-visual cues to detect social interactions such as who is talking to who, and who is looking at who. The tasks involved exactly that: Looking at me (LAM; POC: Eric Zhongcong Xu, National University of Singapore) and Talking to me (TTM; POC: Yunyi Zhu, National University of Singapore)

Benchmark 5: Forecasting

Finally, the forecasting benchmark evaluates a model’s ability to predict what the camera wearer may do in the future. This is evaluated using tasks such as Future Hand Prediction (FHP; POC: Wenqi Jia, Georgia Tech), Short-term Anticipation (STA; POC: Antonino Furnari, University of Catania) and Long-term Anticipation (LTA; POC: Tushar Nagarajan, University of Texas, Austin).

Participation

Even though the data was released only a couple months before the CVPR challenge, and about 6 months before the ECCV challenge, both challenges received strong participation. Interestingly, we found the number of final submissions increased much more than the number of teams, as we opened all challenges at ECCV. This suggests that quite a few teams participated and won multiple challenges, leveraging the synergy between the different tasks. In fact 7 out of the 18 teams at ECCV placed top-3 in at least 2 challenges each.

With regard to the different challenges and tasks, we found that Episodic Memory (EM) and Hands and Objects (HO) challenges received the largest number of submissions (shown here for ECCV). This could be attributed to the fact that participants were able to repurpose existing video question-answering and action recognition techniques for EM and HO tasks. For the relatively more novel problem settings in AV+Social and Forecasting, we expect participation to grow in the future iterations as new techniques are developed to tackle those tasks.

Interestingly, a large portion of our participants came from universities. In this pie chart, we show the relative size of teams, and color them blue if they mostly (>50%) comprise of researchers affiliated with universities. Moreover, many of these university teams performed exceedingly well in the challenges, obtaining top-3 position on multiple tasks. This shows that Ego4D is not limited to industrial labs with significant compute resources, and in fact can be used by researchers everywhere.

Just like the Ego4D data, our participants spanned the globe! Here is just a subset of the institutions the participants came from, in the ECCV challenges:

In terms of performance on benchmarks, we saw a similar trend as we saw with participation: we got largest gains on Episodic Memory and Hands and Objects tasks. Interestingly, the gains were smaller on tasks that were run at CVPR, suggesting that the second instance of the challenge already became much more competitive and exciting, pushing participants to innovate to beat the previous winners.

The Winners

A list of our winners, with their team details, validation reports and code has been posted to our challenge page for CVPR and ECCV. A huge congratulations to the 12 and 30 winning teams from the CVPR and ECCV challenges respectively! Most notably, the team “VideoIntern” and “Red Panda@IMAGINE” won 5 and 3 challenges respectively at ECCV. Here is a list of teams that won multiple challenges at ECCV, and deserve a special mention:

Here is a picture from the ECCV workshop congratulating some of the winners who attended the workshop in person.

Key Technical Trends

While a lot of novel approaches were proposed by various different teams as described in their challenge reports shared above, there were a few techniques that appeared in multiple submissions:

Trend 1: Egocentric vision-language pretraining

In line with the recent growth in learning representations from web-scale vision-text data, pre-training on Ego4D’s videos paired with narrations emerged as a strong approach in both CVPR and ECCV challenges. Most notably, the EgoVLP model won multiple challenges at CVPR, and was the basis of many submissions at ECCV. Other teams also leveraged vision-language pretraining, such as the NLQ winner at the CVPR challenge from UTS, as well as one of the LTA winners which used 3rd person vision-language pretraining (CLIP).


(Figure from the EgoVLP paper)

Trend 2: Strong visual backbones, use of Transformer-based architectures, and joint image-video modeling

Another popular approach multiple teams used was leveraging modern transformer based architectures, typically with large scale pretraining. One such model was the Swin Transformer, which appeared in many submissions, including the OSCC and PNR winning submission at CVPR and the SCOD winning submission at ECCV that outperformed prior work by >150% relative gain. Similarly, ViT based models also appeared for videos, including TimeSformer for OSCC at CVPR and Video ViTs at ECCV. Another trend was around using features trained for both image and video recognition. For instance, the SViT approach that won PNR task at CVPR trained a joint model on images and videos. Furthermore, features from the Omnivore model (that is trained jointly for images and videos) appeared in multiple submissions, including the NLQ winner at CVPR, and NLQ and MQ 2nd position winner at ECCV.

Trend 3: Self-supervised learning for the win

Finally, we saw self-supervised learning emerge as a powerful technique for video understanding tasks. Most notable was quick adoption of Masked Autoencoder (MAE) style pretraining on videos. This technique was particularly popular at the ECCV challenge (understandably as video extensions of MAE came out in early 2022), and got strong gains on many tasks, including VideoIntern’s 5-challenge winning solution, 2nd place winner for PNR and OSCC, as well as text-level masked training that was effective in the winning submission to the transcription challenge.

What’s next?

2022 was a great year for Ego4D and the egocentric vision community. Since Ego4D’s release in February, nearly 800 institutions and individuals from around the world are using this dataset. It has pushed the state-of-the-art in a number of fields, including Computer Vision, Audio-Visual/Multimodal modeling, and even Robotics. Nearly $100K in prize money was distributed to researchers from around the world, to reward their efforts in this challenge and to encourage them to continue pushing the boundaries of egocentric computer vision.

Going forward, we have already re-opened all challenges for submission for future challenges. We are also planning to introduce new, never seen before test sets in future challenges, to make them even more exciting! Stay tuned on the Ego4D Forum for updates about our plans next year.

Intrigued? Join us next year!

You don’t need industrial scale computing infrastructure to participate in these challenges! While the full Ego4D data is large, the different challenges only use a subset of the full data. And even for those subsets, pre-extracted features are available here. If you are interested in egocentric vision and want to help us push the state-of-the-art, please join our challenges next year!

Acknowledgements

Running these challenges was a huge undertaking, and would not have been possible without the hard work and support of many people and institutions. We note a few below:

  • The participants! Without their hard work in taking on this new and challenging dataset to push the results on many novel tasks, we would not have had such a successful challenge.
  • The challenge POCs noted above! They worked tirelessly to set up and maintain the challenge portals, responded to issues, fixed bugs, analyzed reports and finalized the winners. Performing these duties in parallel to working on their own research was no small feat and we are super grateful to all our POCs for making these challenges a success!
  • The Ego4D engineering team, especially Gene Byrne, Devansh Kukreja and Miguel Martin. They helped release and maintain the data and annotations used in the challenges, in addition to maintaining code, baseline, challenge etc repositories.
  • The EvalAI team, and specifically, Rishabh Jain and Ram Ramrakhya, for hosting the Ego4D challenges and providing regular technical support. EvalAI operates with voluntary support and as an open source platform for the community to track progress. We thank them for their work on Ego4D and their contributions to the field generally.
  • CVDF for hosting the Ego4D data and making it available to researchers from around the world.
  • The Ego4D consortium and PIs for helping work through many intricacies of running a large international challenge, as well as giving advice on many technical and non-technical issues.

Finally, we thank you all for being a part of the growing Ego4D community, and making 2022 a successful year for Ego4D! As the year comes to a close, we wish you all have a happy and restful holiday break, and hope to see you re-energized at the Ego4D Challenges 2023 :smiley:

Rohit Girdhar and Andrew Westbury

2 Likes