Learning Traffic Screenplays through 3D Object Detection and Tracking

Stanford Investigators

Partner University Investigators

Katerina Fragkiadaki (CMU)

TRI Investigators

Project Summary

The overall aim of this project is to develop a holistic representation of traffic scenes that encodes both spatial and temporal information of all the actors in the scene. This representation will aggregate point clouds and images captured at regular or irregular time intervals, using the temporal dimension to compensate for data sparsity or sensor artifacts in the spatial dimension. By considering time and space jointly, our representation will naturally encode fundamental physical priors such as object permanence, and consistency across time. The resulting representation can then be queried for the past, present and potential futures of all the mobile entities in this scene, similar to a traffic screenplay. The impact of such a representation is manifold. It may serve downstream tasks of control and planning of autonomous vehicles over different time-horizons and based on different levels of abstractions (pixels or points, mobile entities, scenes). Such a representation may also allow extrapolation and aggregation of sparse annotations of datasets by label propagation.

Research Goal:

Novel dynamic scene representation aggregating temporal and spatial information from frame sequences;
End-to-end differentiable detection and tracking system;
Mid- and long-term dynamic scene estimation and prediction