Skip to main content Skip to secondary navigation
Main content start

Revisiting the "Video" in Video-Language Understanding

In Stanford's ongoing collaboration with Toyota Research Institute (TRI), our research focuses on understanding the deep temporal and causal relationships of events in videos and language. Such capability is critical to the development of interactive agents which can learn about social dynamics and visual concepts in the world from natural supervision.

Graphic visualizing understanding video project
Figure 1: Understanding video and language offers the promise of understanding not only image-level aspects (e.g. static scenes, objects, etc.) but also how and why events change over time. We introduce the Atemporal Probe (ATP), a new model for video-language analysis that helps us "revisit the video" and better understand both the limitations and potential of standard video-language benchmarks towards these capabilities.

The promise of understanding videos lies in what can be discerned not only from a single image (such as scenes, people, and objects), but from what can be perceived over multiple images  -- such as event temporality, causality, and dynamics (Figure 1). Correspondingly, our latest work [1] considers a critical question at the heart of video understanding research: 

What makes a video task uniquely suited for videos, beyond what can be understood from a single image?

This foundational question has been considered many times before in the context of action recognition on trimmed clips [2]. Our analysis builds beyond this work towards settings with video and language, where natural language has the potential to describe more complex and richer event properties, relationships, and dynamics.

Standard baselines for assessing "image-constrained" (or atemporal) understanding of videos typically revolve around sampling a random frame or averaging information across frames. But this may not be a representative, since videos are naturally noisy, correlated collections of frames. Consider the following video, showing a puppy in a human hand:

hand with flat palm holding a small puppy
Figure 2: Videos are naturally noisy, correlated collections of frames: while some frames (highlighted in blue) have clear image-level semantics, a significant number of frames contain camera motion blur, difficult perspectives, or are generally uninformative. A baseline for image-level understanding operating on such noisy input may not really represent the boundary between image and video multimodal understanding. This challenge motivates the design of our proposed ATP model.

We can see in Figure 2 that not all frames carry clear semantic information, due to camera motion blur, strange camera views, etc. This means these standard approaches may not really represent the boundary on where image-level understanding stops, and video-level understanding begins!

We introduce the Atemporal Probe (ATP), a new technique for video-language analysis which builds on the progress in recent image-language foundation models to help us better answer our analysis question: 

What makes a video task uniquely suited for videos, beyond what can be understood from a single image (that is well-selected, without temporal context)?

With this goal, ATP learns to select a single image-level input from a sparsely sampled set of video frames. Importantly, we incorporate a number of bottleneck restrictions to ATP's design, so that this selection is done without temporal information. In this way, ATP provides a much stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding.

By applying ATP to standard tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. Surprisingly, we find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts (e.g. [3]) specifically intended to benchmark deeper video-level understanding.

We also demonstrate how ATP can be useful to help the field advance video-level understanding going forward, both as (1) a potential in-the-loop tool for improving dataset design, and (2) as a way to improve efficiency and accuracy of video-level reasoning models.

Graphic visualizing understanding video project
Figure 3: (left) our proposed ATP is able to help identify settings in standard video-language benchmarks (e.g. video question answering) where image-level understanding may be (unintentionally) all we need to answer the question, and settings where multi-frame, multi-event video-level understanding is really required. This can help us improve dataset design of video-language benchmarks going forward. (right) in our paper, we also describe how we can use ATP to further improve accuracy and efficiency in video-level models.

You can find more details about our proposed ATP model design and our analysis findings in our paper.

We are excited to continue collaborating with TRI to discover research breakthroughs that will help us move toward our long-term goals of creating interactive agents which learn about complex events in the world through video and language.


  • [1] Buch et al. "Revisiting the "Video" in Video-Language Understanding." CVPR 2022.
  • [2] Huang et al. "What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets." CVPR 2018.
  • [3] Xiao et al. "NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions." CVPR 2021.

For more research from our labs visit:

More News Topics