Bad Annotations in the Dataset (VQ2D)


There seems to be alot of bad annotations in the dataset. Are we to assume that no such annotation mistake is there in the VQ2D test dataset?

For example, the bad annotations persist in v1_0_5 of the dataset where for “video_uid”: “14a05360-fcc4-4bf6-97d3-2d77bc282c84” and “clip_uid”: “f9c9c2ec-c5fd-46b8-bbb6-49c7dda702af” the “object_title”: “vacuum cleaner” has “x”: 966.67, “y”: 172.71,“width”: 700.13,“height”: 660.46 but if we look at the crop using the baseline with the given annotations, it only shows half of the vacuum cleaner whereas the online visualizer shows that it was visible completely.

Similarly, for “video_uid”: “3f0bd238-228d-4796-a3e4-820308fb04b0” and “clip_uid”: “6b93fc6d-ed92-42da-886a-0e532e5f66cb” the visual crop with “object_title”: “microwave” is said to have “frame_number”: 1093 and “video_frame_number”: 14657. Now the video_frame_number looks correct with the online visualizer but the frame_number does not make sense since “video_start_frame”: 8099 and “video_end_frame”: 15959 so should the frame_number for this visual crop not be 6558? I am not sure about the bounding box for this too since things do not seem to match the online visualizer for this too. The visualizer says that
x: 298.3

y: 40.96

width: 768.69

height: 584.29

but the annotation file says “x”: 593.64, “y”: 679.35, “width”: 404.54, “height”: 180.34

I have been seeing many such instances where either the response track was incorrectly labelled (the object is not even visible in those frames), the object to be referenced as the visual crop is either not in the frame at all or the crop does not contain the object but something else.
Can someone please confirm this? I have been using the baseline code on the EGO4D github.

Hey @asjad.s,

Thanks for your post, can you help us verify these?

  1. I don’t see any vaccum cleaners in vq for 14a05360-fcc4-4bf6-97d3-2d77bc282c84; is this the right video_uid?

  2. Video 3f0bd238-228d-4796-a3e4-820308fb04b0 has two objects labeled ‘microwave’ with visual crops on the same frame. One has the first bbox you mentioned, the second has the other; which should explain the difference.

Can you share any other inconsistencies you’ve found? We’ll take a look.

  1. Apologies, the video_uid was 7f09822a-87b9-4eac-bb34-3f1059c704d1
  2. For 3f0bd238-228d-4796-a3e4-820308fb04b0 you are correct, regarding the bounding boxes ( I happened to have missed that other instance of microwave), however the frame_number 1093 in the test_val.json gives us a crop of the user’s hand and not the other microwave.

Thankyou so much for your help. I guess we can modify our codes if there are issues with frame_number and simply use video_frame_number instead.