More detail about how the ground truth EgoPose Annotations were created

Hi I’m interested to understand more about how you generated the ground-truth labels for your EgoPose datasets (both 2D keypoints and 3D joint positions).

According to your documentation the labels were applied via both “human annotated & automatically generated”. Please could you explain in more detail the

  • automatic generation process Was these labelled by existing meta tracking networks? If so what are the accuracy/robustness/reliability of the networks and are they sufficient to produce ground-truth labels? Or are they close to ground-truth? What’s the margin of error in the ground-truth data you have published?
  • human annotated process What did this involve exactly?

We developed a specialized pipeline. You can see the code for it here: Ego4d/ego4d/internal/human_pose at main · facebookresearch/Ego4d · GitHub - the steps are roughly:

The codebase only works on internal versions of the dataset (shared within the consortium). You will likely have to fork it or clean up the code in order to use it yourself.

There are obvious improvements that could be done, but we just didn’t have the time to do so.

From the automatic annotations we requested individual frames to be corrected based on heuristics and diversity (w.r.t pose, environment, etc.). The annotation tool worked on a per-frame basis and showed all cameras for the corresponding frame. During annotation, the tool had the option to triangulate points automatically (to reduce time required to perform the annotation task). @suyogjain can provide additional details (if needed).