Ambiguity in NLQ annotations

I am looking through the NLQ annotations and it seems that the queries + ground truths can be quite ambiguous.

Such an example can be seen for video id: 3534864b-2289-4aaf-b3ed-10eeeee7acd2, and query: “Where did I put the scooper”.

The ground truth is given to be around 1675s. However, the scooper is seen to be placed onto the tabletop and subsequently on the weighing scale at around timestamp 1785s. There seems to be multiple windows in the clip that can work as appropriate responses to the query. How should one go about resolving these conflicts?

Hello @davidyao99 ,

Thanks for your question. In general, the annotation process had the following

(a) Natural Language Queries are designed to have issued at the end of the clip. Therefore, the last meaningful temporal window should be selected for non-specific queries like “Where did I put the scooper?”. This aligns well with downstream use cases where users would want to locate the particular item.

(b) If the queries result in multiple temporal windows that are disjoined, annotators have been instructed to add specificity to eliminate this ambiguity. In the above example, the annotator could have specified, “Where did I put the scooper before weighing the flour?”.

That said, it is possible that a percent (hopefully small) of annotations are noisy due to human error. Do you have an estimate of how often such ambiguous queries occur?

Hope this answers your question more broadly.

Based on looking through a few videos, and looking at only “Where did I put X” prompts, it seems like ambiguous queries like this occur around 10% of the time. This might also be due to the sometimes slightly vague nature of the prompt.

Additionally, I realize that there are also issues with the template labelling of the annotations. For example, “Where did I put X” might be labelled by another template. This happens perhaps < 5% of the time, so its not a big issue.