In Stanford's ongoing collaboration with Toyota Research Institute (TRI), our research focuses on understanding the deep temporal and causal relationships of events in videos and language. Such capability is critical to the development of interactive agents which can learn about social dynamics and visual concepts in the world from natural supervision.
The promise of understanding videos lies in what can be discerned not only from a single image (such as scenes, people, and objects), but from what can be perceived over multiple images -- such as event temporality, causality, and dynamics (Figure 1). Correspondingly, our latest work [1] considers a critical question at the heart of video understanding research:
What makes a video task uniquely suited for videos, beyond what can be understood from a single image?
This foundational question has been considered many times before in the context of action recognition on trimmed clips [2]. Our analysis builds beyond this work towards settings with video and language, where natural language has the potential to describe more complex and richer event properties, relationships, and dynamics.
Standard baselines for assessing "image-constrained" (or atemporal) understanding of videos typically revolve around sampling a random frame or averaging information across frames. But this may not be a representative, since videos are naturally noisy, correlated collections of frames. Consider the following video, showing a puppy in a human hand:
We can see in Figure 2 that not all frames carry clear semantic information, due to camera motion blur, strange camera views, etc. This means these standard approaches may not really represent the boundary on where image-level understanding stops, and video-level understanding begins!
We introduce the Atemporal Probe (ATP), a new technique for video-language analysis which builds on the progress in recent image-language foundation models to help us better answer our analysis question:
What makes a video task uniquely suited for videos, beyond what can be understood from a single image (that is well-selected, without temporal context)?
With this goal, ATP learns to select a single image-level input from a sparsely sampled set of video frames. Importantly, we incorporate a number of bottleneck restrictions to ATP's design, so that this selection is done without temporal information. In this way, ATP provides a much stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding.
By applying ATP to standard tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. Surprisingly, we find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts (e.g. [3]) specifically intended to benchmark deeper video-level understanding.
We also demonstrate how ATP can be useful to help the field advance video-level understanding going forward, both as (1) a potential in-the-loop tool for improving dataset design, and (2) as a way to improve efficiency and accuracy of video-level reasoning models.
You can find more details about our proposed ATP model design and our analysis findings in our paper.
We are excited to continue collaborating with TRI to discover research breakthroughs that will help us move toward our long-term goals of creating interactive agents which learn about complex events in the world through video and language.
References:
- [1] Buch et al. "Revisiting the "Video" in Video-Language Understanding." CVPR 2022.
- [2] Huang et al. "What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets." CVPR 2018.
- [3] Xiao et al. "NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions." CVPR 2021.
For more research from our labs visit: