RealTraj: Towards Real-World Pedestrian Trajectory Forecasting

Keio University1, NVIDIA2

Banner Image

Our paper addresses the three limitations in the existing pedestrian trajectory forecasting task. (Top) Pedestrian perception errors can significantly degrade trajectory forecasting performance. (Middle) Real-world data collection necessitates substantial manual effort. (Bottom) Person ID annotations require extensive manual labor.

Abstract

This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases—self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data—to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant to tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, significantly reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.

Approach

Banner Image
Our proposed framework consists of two training phases and an inference phase. (1) Self-supervised pretraining on synthetic data using multiple pretext tasks. (2) weakly-supervised fine-tuning on real ground-truth detections. (3) Future trajectory inference based solely on detection inputs.
-->