VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Abstract

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates.

To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic VidEo-text annotation pipeline to generate captions with RelIable FInE-grained statics and Dynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses.

With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR.

Datasets Visualization

The fine-grained static and dynamic content is marked in green and blue, and inaccurate content is marked in red.

Previous coarse-grained caption: A child is hitting a ball over a net in a gym.

Statics enhanced caption: In a lively gymnasium with a green floor, a child in a green shirt hits a tennis ball over a net, surrounded by scattered tennis balls and with two chairs visible on the right. (Confidence score = 2.42)

Dynamics enhanced caption: Youngster at gym hits single shuttlecock, lifting high, swinging overhand, and sending it over net. (Confidence score = 0.42)

Inaccurate fine-grained caption: A child in gym hits a shuttlecock over a net with a large, overhand swing, following through towards the left. (Confidence score = -3.26)

Previous coarse-grained caption: The person puts down the bag.

Statics enhanced caption: The person, who is a woman wearing a pink hoodie and glasses, puts down the bag. (Confidence score = 2.11)

Dynamics enhanced caption: With a still gaze, the figure in pink attire and glasses sets a logoed bag on a circular table. (Confidence score = 2.14)

Inaccurate fine-grained caption: The person puts down a white bag on the floor. (Confidence score = -2.24)

Previous coarse-grained caption: Cat looks up for the first time.

Statics enhanced caption: The cat, with its mix of brown and white fur and green eyes, looks up for the first time. (Confidence score = 1.79)

Dynamics enhanced caption: Cat lifts head and gazes upwards, then looks back at camera. (Confidence score = 0.38)

Inaccurate fine-grained caption: Tabby cat momentarily lifts head to look at sky before returning focus. (Confidence score = -3.73)

Previous coarse-grained caption: A person is opening a bag.

Our fine-grained caption: A person, wearing a green sweatshirt and glasses, sits on a couch and opens a colorful bag with both hands to reveal a cell phone.

Previous coarse-grained caption: A person is opening a bag.

Our fine-grained caption: Person in kitchen, holding cherry-patterned bag, opens it by pulling drawstring.

Previous coarse-grained caption: A person opens a bag.

Our fine-grained caption: A person, wearing a blue shirt and shorts, opens a plastic bag, revealing a box and a green object.

Previous coarse-grained caption: A dog runs through tubes.

Our fine-grained caption: A dog, encouraged by its handler, runs quickly through a series of red tubes from left to right.

Previous coarse-grained caption: Person throwing things on the floor.

Our fine-grained caption: Sitting figure repeatedly throws single bottle caps to the floor.

Previous coarse-grained caption: The man puts ingredients in a regular glass while showing the bottles of ingredients he is using.

Our fine-grained caption: The bartender displays and pours Bacardi rum, Funkin Sugar Cane, and Funkin Pro Puree into a glass, revealing the labels beforehand.

Previous coarse-grained caption: A woman is seen speaking to the camera while standing in front of a board.

Our fine-grained caption: A woman performs a step aerobics routine in front of a board, gesturing with her hands.

Previous coarse-grained caption: The person pours himself a glass of water.

Our fine-grained caption: The man pours water into a glass and then eats a granola bar.

Previous coarse-grained caption: A group of people meet at the top of a hill.

Our fine-grained caption: A man on a ski lift smiles, begins his descent, and showcases tricks on a snowy slope.

Previous coarse-grained caption: A man throws the frisbee to the dog in the yard who continues chasing after him.

Our fine-grained caption: A man engages in a frisbee game with an energetic dog near a waterfall.

Previous coarse-grained caption: Horizontal railing is not visible.

Our fine-grained caption: Railing concealed by boat's motion and camera's rightward pan amidst flying seagulls.

Previous coarse-grained caption:

Lights streak upwards before going towards left.

Dynamics oriented questions:

In what direction do the lights move after initially streaking upwards?
What is the speed and trajectory of the lights as they move upwards?
Is there any pattern or rhythm to the lights' movement, such as regular intervals or synchronization?
Do the lights emit any sounds or produce any other effects as they move?
Are there any other objects or elements in the video that interact with the moving lights?

Dynamics oriented answers:

After initially streaking upwards, the lights move towards the left.
The lights move upwards with a moderate speed and a slightly curved trajectory.
There doesn't appear to be a clear pattern or rhythm to the lights' movement.
The video doesn't provide audio, so it's unclear if the lights emit any sounds.
There are no other visible objects or elements in the video that interact with the moving lights.

Dynamics oriented descriptions:

The video starts with a dark background and a stream of lights originating from the bottom center of the frame. The lights ascend upwards in a somewhat curved path, resembling a firework's trajectory. As the lights move upwards, they gradually disperse and become less dense, creating a trail-like effect. The overall movement of the lights is smooth and continuous, without any sudden changes in speed or direction. The video focuses solely on the lights, with no other objects or elements visible in the frame.

Our fine-grained caption:

Lights streak upwards in a curved path, then disperse and move left.

Previous coarse-grained caption:

Person they laugh.

Dynamics oriented questions:

What are the facial expressions or body language indicators of the person's laughter?
Is the laughter continuous or intermittent throughout the clip?
Are there any other people or objects in the scene that may be eliciting the person's laughter?
Does the person's laughter seem genuine or forced?
Is there any noticeable change in the person's demeanor or emotional state before and after the laughter?

Dynamics oriented answers:

The person's laughter is indicated by a wide smile, crinkled eyes, and head tilted back slightly.
The laughter appears to be continuous throughout the clip.
There are no other people or objects visible in the scene that could be causing the laughter.
The person's laughter seems genuine, with relaxed facial muscles and a natural expression.
The person's demeanor appears consistently happy and amused throughout the clip.

Dynamics oriented descriptions:

The video starts with a medium shot of a person bent over, then they stand up and face the camera, revealing a young woman with glasses and a blue shirt. The background is a simple room with pink walls and a doorway. The woman is smiling and laughing throughout the clip, with her head tilted back slightly and her eyes crinkled. She holds a pink cloth in her hands and occasionally brings it up to her face, as if to hide her laughter. The camera remains static, and the lighting is consistent throughout the video.

Our fine-grained caption:

A woman in a blue shirt and glasses, holding a pink cloth, laughs continuously with a wide smile and crinkled eyes.

*These video clips are excerpted from Charades, DiDeMo, ActivityNet. The access to raw videos complies with the license of Charades, DiDeMo, and ActivityNet by previous collectors.

BibTeX

@misc{chen2024verifiedvideocorpusmoment, title={VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding}, author={Houlun Chen and Xin Wang and Hong Chen and Zeyang Zhang and Wei Feng and Bin Huang and Jia Jia and Wenwu Zhu}, year={2024}, eprint={2410.08593}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.08593}, }

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Our VERIFIED pipeline framework.

Abstract

Video

Datasets Visualization

BibTeX