AI2 Research

MolmoMotion

Forecasting Point Trajectories in 3D with Language Instruction

1Allen Institute for AI     2University of Washington     3UNC-Chapel Hill

*Equal contribution     core contributorCore contributors

Given 3D query points on an object, a short RGB history, and a language instruction, MolmoMotion predicts each point's future 3D trajectory in a metric world frame.

3D Motion Forecasting

Given a visual history, query points on an object, and a language instruction, MolmoMotion predicts the future 3D trajectory of each point, capturing rigid, articulated, and deformable motion across indoor, egocentric, and outdoor scenes.

I. Robotics Planning

Object motion in 3D is largely embodiment-agnostic, so the prior learned from human video transfers to robots. Fine-tuned on real-robot videos from DROID, MolmoMotion forecasts coherent object trajectories, and initializing a manipulation policy from it improves downstream pick-and-place.

Robotics planning results
A pick-and-place policy initialized from MolmoMotion reaches higher task success and learns faster than a Molmo 2 baseline (76.3% vs 56.0% on MolmoSpaces); higher is better.

II. Video Generation

MolmoMotion's predicted 3D trajectories act as an explicit motion-control signal for image-to-video generation. Conditioning on them makes the generated motion follow the instruction more faithfully than text alone.

Video generation metrics
MolmoMotion-guided generation (DaS + MolmoMotion) improves all five motion-related metrics over the base model and beats a much larger image-to-video model on four of five; higher is better.

Our MolmoMotion Framework

MolmoMotion builds on the Molmo 2 vision-language backbone, which grounds the language instruction to objects and points in the image. From image tokens, the action description, and 2D query-point features, it decodes each point's future 3D trajectory in two variants: an autoregressive model (MolmoMotion-AR) that emits coordinates as quantized text, and a flow-matching model (MolmoMotion-FM) that generates continuous trajectories from noise.

MolmoMotion architecture

Training Data: MolmoMotion-1M

MolmoMotion-1M supplies the supervision this task requires: large-scale video paired with object-grounded 3D point trajectories and action descriptions. Because no existing dataset offers this combination, we generate it with an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video.

The pipeline lifts dense 2D tracks into a shared metric 3D frame, removes points that do not move coherently with the object, smooths the rest, and clips each video to the window of real motion. At scale it yields MolmoMotion-1M, to our knowledge the largest corpus of action-described, object-grounded 3D point trajectories to date, spanning 736 motion types and 5.6K distinct objects.

Example MolmoMotion-1M training samples: object-grounded 3D point trajectories extracted from unconstrained video by our pipeline.
Overview of the data annotation pipeline. We ground the moving object and sample query points, track dense 2D points, lift them into a shared metric 3D frame, filter unreliable trajectories with object-level spatial and temporal consistency priors, and clip the video to intervals of meaningful motion.

PointMotionBench

PointMotionBench measures 3D motion-forecasting accuracy on held-out data. It comprises 2.7K human-validated clips across 111 object categories and 61 motion types, spanning indoor manipulation, egocentric hand-object interaction, and outdoor dynamic scenes. Each method receives the current observation, query points, and an action description, and is scored on how closely its predicted 3D trajectories match the object's true future motion, a direct quantitative test rather than whether a track merely looks plausible.

3D motion forecasting results on PointMotionBench
On PointMotionBench, MolmoMotion beats pixel-space video generators, parametric-3D methods, and constant-velocity extrapolation. Bars show 3D average displacement error in meters; lower is better.

Limitations

MolmoMotion uses only 8 query points per object, which limits dense geometry and complex deformable motion; broader real-world and closed-loop evaluation remains future work.

Acknowledgements

This work would not be possible without the support of our colleagues at Ai2. We thank David Albright, Kristin Cha, Byron Bischoff, David Everhart, Jon Borchardt, Kyle Wiggers, Will Smith, Peter Clark, Dieter Fox, and Noah Smith for their important work for the MolmoMotion public release. We thank Ropedia for providing access to the Xperience dataset used in this work, and granting permission the release of MolmoMotion under the Apache License 2.0. Chenhao Zheng is partially funded through an Apple grant. We thank Oncel Tuzel, Pavan Kumar, and Rick Chang for the helpful discussion and support on this project.