Trends in Motion Prediction Toward
Deployable and Generalizable Autonomy:
A Revisit and Perspectives

Letian Wang¹, Marc-Antoine Lavoie¹, Sandro Papais¹, Barza Nisar¹,
Yuxiao Chen², Wenhao Ding², Boris Ivanovic², Hao Shao³, Abulikemu Abuduweili⁴, Evan Cook¹, Yang Zhou¹,
Peter Karkus², Jiachen Li⁵, Changliu Liu⁴, Marco Pavone^2,6, Steven L. Waslander¹

¹University of Toronto, ²NVIDIA Research, ³Chinese University of Hong Kong,
⁴Carnegie Mellon University, ⁵University of California, Riverside, ⁶Stanford University

Paper Video (Coming Soon) Gitbook (Coming Soon)

Taxonomy of Motion Prediction Methods

When intelligent autonomous systems are deployed in real-world environments, they have to coexist and interact with other agents of those spaces. To understand the surroundings and make efficient and safe decisions, these systems need to observe, reason about, and predict the future motion of other agents and the future evolution of the environment. This is challenging due to the complexity of agents behavior and the interdependence between agents and their environments. Factors such as task goals, social norms, interactions between agents, environmental layout, and legal constraints all influence agents behavior. While the major focus of this paper is on the deployability and generalizability of motion prediction, we include this brief taxonomy of motion prediction models as supporting background knowledge.

(CLICK the following section titles to revist motion prediction)

Taxonomy: Input and Output Modalities

Ideally, motion prediction models need to consume all the motion factors available to them as inputs in order to produce high-quality predictions. Achieving this requires input and output representations that accurately capture these factors and comprehensively describe the predicted behaviors. In light of such complexity, the research community has devoted increasing attention to understanding and modeling the temporal evolution of dynamic objects and scenes. The field has progressed rapidly, transitioning from early physics- and planning-based paradigms to the prevailing wave of learning-based approaches. This evolution has been accompanied by a growing diversity of representational choices—from compact formats such as trajectories, motion primitives, and flows, valued for their efficiency and interpretability, to sensor-level inputs like video and point clouds, enabled by advances in computational resources and the emergence of the world model concept, and further to latent representations such as tokens and queries.

Taxonomy: Application Domains

Motion prediction is a common challenge across multiple domains, which has resulted in overlapping methodologies that are shared across application domains, and also many methods that are specific to the unique requirements of each domain. Foremost is Autonomous Driving and Navigation, which strives to automate safe navigation along roadways. This domain significantly drives motion prediction research due to the need for high-accuracy, real-time predictions of 2D ground-level states for various road agents like vehicles, pedestrians, and cyclists. Predictive models here integrate historical states, road maps, and traffic regulations. Another vital area is Robotics and Human-Robot Interaction (HRI), crucial for robots to operate safely around humans in diverse unstructured environments such as homes or factories. This field currently faces limitations from less standardized datasets and evaluation protocols. Visual Motion Understanding represents another domain, with applications in video surveillance, various forms of monitoring (e.g., crowd control, industrial safety), and the generation of realistic human motion for virtual/augmented reality (VR/AR). Finally, motion prediction is recognized as a specialized form of general time-series prediction. Its methodologies are broadly applicable to other time-series tasks, including weather forecasting, financial market analysis, healthcare monitoring, and energy demand forecasting.

Deployability in Motion Prediction

In autonomous systems that operate in dynamic environments, such as autonomous vehicles and robotics, motion prediction does not act in isolation but functions as one module of closed-loop autonomy stacks by receiving upstream localization and perception and informing downstream planning and control. Since the 2004 DARPA Grand Challenges, decomposing autonomous systems into separate modules has become a well-established paradigm. Under this practice, benchmarks have been created to evaluate these modules independently, allowing researchers to develop novel and effective methods for each component. Specifically, for motion prediction, existing benchmarks often provide curated and noise-free upstream perception input, utilize trajectories as the interface between upstream and downstream modules, and focus solely on open-loop prediction accuracy without considering the impact on downstream planning and control. This paradigm has been highly successful, fostering the development of innovative methodologies and representations. However, when deployed in real-world settings that can deviate from such idealized training conditions, state-of-the-art motion prediction models can come with reduced effectiveness and reliability in deployment. In this section, we discuss the approaches, challenges, and perspectives of developing motion prediction models under realistic deployment standards in autonomous systems, where several key problems emerge from: 1) perception-prediction integration; 2) prediction-planning integration; 3) system-level closed-loop evaluation.

(CLICK the following section titles to see quick takeaways)

Deployability: Perception-Prediction Intergration

In typical autonomy stacks, perception provides essential inputs for motion prediction to infer the future evolution of objects or a scene. The seamless integration between perception and prediction is crucial for efficient information propagation and overall system safety and performance. This integration requires 1) Representation: proper design of representations and the interface between the two modules that enable smooth information propagation and efficient computation, such as trajectories, occupancy, latent feature, raw sensor data such as videos and point clouds; 2) Uncertainty: propagating, calibrating, and eventually reducing the uncertainties and errors from upstream perception (e.g. detection, tracking, mapping) to enable robust prediction; 3) Joint Learning: joint learning and temporal information fusion of the perception and prediction to facilitating information sharing and align the optimization to the performance of the overall system. These concepts are illustrated in the folowing figure and are explored in greater details in the paper.

Deployability: Prediction-Planning Intergration

In typical autonomy architectures, motion prediction provides essential inputs for motion planning to generate safe and efficient ego trajectories. As we approach the end of the autonomy stack, it introduces additional design desiderata beyond accuracy, placing greater emphasis on compatibility with planning algorithms and real-world closed-loop system performance and safety. Achieving seamless integration with planning involves 1) Representation: choosing suitable representations at the prediction–planning interface that enable effective information propagation, computational efficiency, and compatibility with downstream modules, such as trajectories, occupancy, latent features, motion maneuver/skills, and sensor-level representations; 2) Uncertainty: it also requires properly propagating, calibrating, and ultimately reducing the uncertainties and errors from prediction to support robust and risk-aware planning; 3) Joint Learning: furthermore, joint learning of prediction and planning encourages the design of prediction outputs that are better structured for direct consumption by planning algorithms, which is especially crutical in interaction-intense scenarios. These concepts are illustrated in the followin figure and are explored in greater detail in the paper.

Deployability: Evaluation

As we just discussed above, in realistic deployment, motion prediction does not operate in isolation, but functions as a critical module within the closed-loop autonomy stack, receiving inputs from upstream localization and perception, and informing downstream planning and control. However, existing benchmarks typically adopt a simplified evaluation setting, where prediction is assessed independently of other modules and in an open-loop manner. Specifically, they often assume trajectories as the only intereface, idealized and curated perception inputs without accounting for associated noise and errors, and disregard how prediction results are consumed by downstream planning. Moreover, evaluations are typically conducted in open-loop settings, overlooking important aspects such as temporal consistency, ultimate task performance, and computation efficiency. To more accurately reflect real-world performance, a realistic evaluation framework must incorporate perception-aware metrics, planning-oriented metrics, and closed-loop dynamics, which we illustrate in the following figure and discuss in the paper, respectively.

Deployability: Future Outlooks

The interplay between perception, prediction, and planning demonstrates that motion prediction cannot be treated in isolation. Instead, it requires holistic approaches that account for intermodule information sharing and compatibility, the cascading effects of errors, and alignment with systemlevel performance. Despite significant advancements in applying motion prediction to real-world robotics applications, several challenges persist that present opportunities for future research. We list some of them here and discuss them in greater details in the paper.

Holistic Integration: We are revolutionizing motion prediction by integrating it seamlessly into the full autonomy stack, moving beyond isolated components. This means tackling cascading errors from upstream perception and ensuring alignment with ultimate system performance, requiring holistic approaches for robust, real-world deployment.

Real-World Evaluation: Bridging the gap between idealized lab tests and challenging real-world conditions is paramount. Evaluation must account for uncertainties, reflect downstream utility for planners, and prioritize temporal consistency, computation latency, and task-oriented metrics. New benchmarks are essential to capture this complexity.

Task-Oriented Performance: Success isn't just about prediction accuracy, but about enabling ultimate task completion – like a vehicle reaching its destination safely and efficiently. This demands confident, well-calibrated predictions that support effective planning, revealing critical issues missed by traditional metrics.

End-to-End Autonomy: Exploring unified, data-driven frameworks that integrate perception, prediction, and planning offers immense potential. This trend aims for streamlined pipelines, enhanced scalability, and improved adaptability, promising breakthroughs in performance, reliability, and robustness.

Foundation Model Power: Harnessing the capabilities of large foundation models is key to advancing both modular and end-to-end systems. Overcoming the challenge of collecting scarce motion data requires innovative strategies, alongside developing methods for efficient deployment under real-time constraints.

Generalization in Motion Prediction

When autonomous systems, such as robots and self-driving cars, are deployed in the real world, they will encounter diverse scenarios varying in many factors such as environmental geometries, new types of agents, unexpected events (weather, accidents) and adversarial attacks. In such contexts, the models can often be exposed to data that is outside the model's training distribution, where the model can have a significant drop in the performance or even fail to generate reasonable results. A straight-forward solution is to extend the training dataset to cover a wide range of operating scenarios through extensive data collection and labeling. While data collection has proven highly effective in fields like computer vision and natural language processing, where large-scale datasets can be scraped relatively easily, collecting and annotating data for robots and self-driving cars remains costly, inefficient, and often unable to capture the full diversity of possible events. This challenge highlights the need for complementary approaches that enhance generalization and fast adaptation—just as humans generalize from past experiences and quickly adapt to new environments with minimal examples, without forgetting previously acquired skills. Ultimately, robust generalization hinges on the seamless interplay between data, learning signals, and model architectures. To achieve goals in the generalization lifestyle, various approaches have been proposed to advance the generalization capabilities of predictive models, including 1) Self-Supervised Learning, 2) Domain Generalization and Adaptation, 3) Continual Learning, 4) Out-of-Distribution Detection and Generalization, 5) Data augmentation and synthesis, and 6) Foundation Models.

(CLICK the following section titles to see our insights of generalization)

Generalization: Synphony of Data, Learning Signal, and Architecture

As in the figure below, we organize various generalization approaches according to how they differ in terms of 1) data access assumptions, such as the scale of the data, and access to target-domain data; 2) their development stage, such as being applied during pretraining, training, or test time; and 3) the primary focus, such as whether they aim to learn generalizable representations through modified or additional learning signal, modify the model architecture for more robustness and adaptability, achieve generalization by scaling up the data, or they represent particular generalization tasks. Below we introduce the gist of each approache and discuss them in greater details in the paper.

Self-Supervised Learning: Self-supervised (SSL) methods use proxy tasks, such as trajectory reconstruction and contrastive learning, to learn informative and transferable representations in addition to the original prediction supervision. This can either be done sequentially, where SSL methods are used as a pretraining step, or simultaneously in multi-task learning, where the model is optimized for both SSL and original supervised losses. In addition to the trajectory representation, SSL methods are also applied to other representations such as point cloud forecasting, video generation, and scene reconstruction.

Domain Generalization and Adaptation: Domain generalization is the most straightforward distribution shift setting and considers the case where a model trained one or more source datasets is tested on one or more different target datasets in zero-shot. This corresponds to dropping the source pre-trained model in a new environment. A common training setting involves multiple source datasets, and models, that learn generalizable features and are robust to distribution shits across the source domains, can perform well on the target domain. In contrast to domain generalization, domain adaptation considers the case where some target data is also available in addition to the source data. The target data is often available only in a limited fashion e.g. only a handful of examples in the form of offline batch data or online streaming data, or without prediction labels. This allows exploiting the target data, and the key challenge is to effectively leverage the target information to maximize performance without overfitting.

Continual Learning: Continual learning extends domain adaptation to a sequence of new datasets or to a continually changing distribution instead of a single shift. Moreover, compared to domain adaptation that ignores performance in the source domain, continual learning methods also evaluate and aim to keep the performance on all previously seen domains, where the challenge is to avoid forgetting past experiences as additional ones are learned.

Out-of-Distribution Detection and Generalization: A slightly different test case considers the presence of rare, difficult, or adversarial examples in the test set. These are the out-of-distribution (OOD) examples. The assumption is that OOD data is so far from standard inlier data that a learned model would not generalize properly as in domain generalization, and so we consider two strategies. The first is OOD detection, where the objective is to identify the OOD data and reject it instead of using it as a prediction input. The second is OOD generalization, which still attempts to generate outputs from the OOD input data, but often requires additional assumptions or models. In practice, this can mean training with representative outliers or using additional unlearned models as a fall-back strategy to supplement the learned model when faced with OOD inputs. This can be done in conjunction with OOD detection by routing detected OOD inputs to the robust model.

Data Augmentation and Synthesis: A promising direction to improve model generalization is to enhance existing datasets through data augmentation and synthesis. Data augmentation aims to increase data diversity by applying transformations or perturbations to existing examples, thereby improving the model's robustness and generalization ability. In motion prediction, augmentation techniques include perturbing agent trajectories, altering scene contexts, modifying initial conditions, or simulating alternative agent intentions. Beyond augmentation, data synthesis further expands dataset diversity by generating new examples that capture a broader range of scenarios. Synthesis can introduce challenging and rare situations—such as collisions or adversarial agents—that are under-presented in the original datasets but are crucial for real-world deployment, particularly in new or unseen test settings. Unlike image-based tasks, motion prediction data is subject to significant structural constraints, such as agent interactions, initial configurations, environment layout, and temporal dynamics. Generated motions must remain physically and behaviorally plausible, respecting map constraints and interaction dynamics over time. As a result, a key challenge in both data augmentation and synthesis is ensuring realism—that is, generating or modifying data in ways that are consistent with plausible human or agent behavior.

Generalization: Future Outlooks

As motion prediction systems are deployed in increasingly diverse and unpredictable environments, generalization becomes a central challenge—requiring models not only to perform well on known domains but also to adapt, recover, or remain robust in the face of distributional shifts and open-world settings. Several key themes that emerge from our paper persist and present opportunities for future research.

Data Scaling and Distribution Understanding: Despite the demonstrated success of large-scale datasets in NLP and computer vision, the field of motion prediction remains largely in a small-data regime due to the costly and labor-intensive process of collecting and annotating motion data. For instance, leading autonomous driving datasets such as Argoverse, Argoverse 2, and the Waymo Open Motion Dataset (WOMD) contain only 320K, 250K, and 480K data sequences, respectively—orders of magnitude smaller than datasets in NLP (e.g., GPT-3's 400B-token CommonCrawl) or vision (e.g., JFT's 303M images). This scarcity of data constrains models’ ability to learn rich and transferable representations, ultimately limiting their robustness and generalization. Therefore, a critical priority is scaling up high-quality datasets, either by unifying existing sources to overcome format and definition discrepancies, by developping efficient data collection pipelines, or by leveraging data synthesis methods. Meanwhile, as datasets scale, simply adding more data is not sufficient. Understanding and quantifying the underlying data distribution becomes essential to ensure diversity, avoid redundancy, and guide efficient data usage. Tools are required to characterize dataset coverage, diversity, and scenario difficulty, moving beyond low-level statistics or handcrafted metrics. This understanding is crucial for nearly every aspect of the generalization lifecycle, such as fair and informative cross-dataset benchmarking, designing efficient learning paradigms with balanced data coverage (e.g. dataset distillation, active learning), evaluating the novelty of newly collected or synthetic data, and guiding the design of generalizable models themselves.

Revolutionize Modelling and Learning Strategies: Motion prediction introduces unique structural and temporal challenges that set it apart from vision and language tasks. Unlike static images or discrete text, motion data consists of continuous trajectories shaped by physical laws, spatial constraints, and multi-agent interactions. Generalizable models in this domain must therefore learn representations that generalize across diverse scene layouts, agent types, behavioral patterns, and geographies—while respecting the causal and temporal dependencies inherent in real-world motion. To this end, several key questions remain open. What inductive biases, tokenization strategies, or architectural modules are best suited to represent agents, maps, and interactions effectively? Should representations be built on trajectory-level abstractions or raw sensor streams like point clouds and videos—which bypass annotation and offer richer context, but introduce high computational costs and struggle to capture discrete agent behaviors? More broadly, the optimal approach for learning informative, transferable features across heterogeneous motion datasets remains unclear—highlighting the need for further research into motion-specific pretraining objectives, scalable architectures, and adaptation strategies. Ultimately, developing such models will require rethinking not only training but also deployment—encompassing how to fine-tune models for specific downstream tasks, and how to ensure robust generalization under distribution shifts through test-time OOD detection, generalization, and adaptation.

Establish Standardized Benchmarks and Unified Evaluation: Although recent works have explored generalizable motion prediction through diverse perspectives—ranging from self-supervised learning and domain adaptation to continual learning, OOD detection, and foundation models—the field remains in its early stages. While progress has been made, many methods are still exploratory and fragmented, hindered by the lack of standardized evaluation protocols. In particular, the absence of widely accepted cross-dataset benchmarks contributes to this fragmentation: each paper often adopts its own customized setting—e.g., specific data scales, shift types, or evaluation metrics—making it difficult to fairly compare methods, assess their generality, or track progress across the field. Future efforts should focus on establishing unified and realistic evaluation protocols that span multiple tasks, data regimes, and evaluation metrics. In contrast to classification, where OOD can be defined by unseen labels, benchmarking in regression tasks like motion prediction requires more nuanced definitions, moving beyond coarse-grained approaches that treat entire different datasets as OOD. Evaluations should also go beyond accuracy to include uncertainty estimation, which is essential for safety-critical downstream applications. Moreover, fair comparisons under equal compute budgets are necessary to ensure meaningful assessment of progress.

Unleash the Power of Foundation Models: As data understanding, generalization methods, and benchmarking continue to mature, foundation models are emerging to unify and extend these capabilities—offering a path toward robust generalization in the open world. This paradigm builds on transformative advances in computer vision and natural language processing, where a large-parameter model trained with a large amount of data shows strong zero-shot and few-shot generalization across diverse tasks and modalities. This emerging trend includes two complementary directions: 1) Developing Motion-Specific Foundation Models: There is a growing need to design large models tailored for the structured and dynamic nature of motion data. This involves unifying diverse datasets, rethinking tokenization and architectural priors, and developing adaptation strategies across agents, geographies, and scenarios. 2) Adapting Existing Foundation Models: Given the relatively limited motion data, adapting general-purpose foundation models (e.g., LLMs, video generation models) trained on internet-scale corpora is a promising but nascent direction. These models offer emergent reasoning and common-sense capabilities that could benefit motion tasks, but challenges remain in bridging modality gaps, aligning spatial-temporal reasoning with physical constraints, and ensuring safe integration into real-world systems. Exploring their potential for zero- and few-shot generalization in motion prediction is a key research opportunity. In essence: the unique robustness to open world distribution shifts that is natural for humans may be an emergent property of large scale models, and the path to achieving human-like performance in prediction systems may be a matter of further scaling up large foundation models. This aligns with "Bitter Lesson", which emphasizes the simple importance of scaling up resources over custom-tailored algorithms and architectures.

BibTeX

@misc{wang2025deployablegeneralizablemotionprediction,
  title        = {Deployable and Generalizable Motion Prediction: Taxonomy, Open Challenges and Future Directions},
  author       = {Letian Wang and
                  Marc-Antoine Lavoie and
                  Sandro Papais and
                  Barza Nisar and
                  Yuxiao Chen and
                  Wenhao Ding and
                  Boris Ivanovic and
                  Hao Shao and
                  Abulikemu Abuduweili and
                  Evan Cook and
                  Yang Zhou and
                  Peter Karkus and
                  Jiachen Li and
                  Changliu Liu and
                  Marco Pavone and
                  Steven Waslander},
  year         = {2025},
  eprint       = {2505.09074},
  archivePrefix= {arXiv},
  primaryClass = {cs.RO},
  url          = {https://arxiv.org/abs/2505.09074}
}

Trends in Motion Prediction Toward Deployable and Generalizable Autonomy: A Revisit and Perspectives