NVIDIA Research
Fast Encoder-Based 3D from Casual Videos via Point Track Processing

Fast Encoder-Based 3D from Casual Videos via Point Track Processing

Wuyue Lu2
1NVIDIA Research
2Simon Fraser University
3Technion
NeurIPS 2024

TracksTo4D receives video frames with a set of pre-extracted point tracks presented in corresponding colors as inputs (left side), and maps them into dynamic 3D structures and camera positions (right side). The output camera trajectory is presented as gray frustums, whereas the current camera is marked in red. The reconstructed 3D scene points are presented in corresponding colors to the input tracks. Note that the outputs presented on this web page are obtained at inference time, with a single feed-forward prediction, without any optimization or fine-tuning, on unseen test cases.


Overview




We tackle the long-standing challenge of reconstructing 3D structures and camera positions from several scene images. The problem is particularly hard when objects are transformed in a non-rigid way. Current approaches to this problem make unrealistic assumptions or require a long optimization time. We present TracksTo4D, a learning-based approach to infer 3D structure and camera positions from in-the-wild videos. Specifically, we build on recent progress in point tracking, to extract long-term point tracks from videos, and learn class-agnostic features. We then design a deep equivariant neural network architecture (Figure 2), that maps the point tracks of a given video into corresponding camera poses, 3D points, and per-point non-rigid motion level values. Training on pet videos, we observe that by just "watching" point tracks in videos, our network learns to infer their 3D structure and the camera motion.

Our model is trained on the Common Pets dataset using only 2D point tracks extracted by CoTracker, without any 3D supervision by simply minimizing the reprojection errors. We evaluate our method on test data with GT depth maps and demonstrate that it generalizes well across object categories. With a short bundle adjustment, we achieve the most accurate camera poses compared to previous methods, all while running faster. More specifically, compared to the state-of-art, we reduced the Absolute Translation Error by 18%, the Relative Translation Error 21%, and the Relative Rotation Error by 15%. Moreover, our method produces depth accuracy comparable to the state-of-the-art method, while being ~x10 faster. We also demonstrate a certain level of generalization to entirely out-of-distribution inputs.

Pipeline


Our pipeline. Our network takes as input a set of 2D point tracks (left) and uses several multi-head attention layers while alternating between the time dimension and the track dimension (middle). The network predicts cameras, per-frame 3D points, and per-world point movement value (right). The 3D point internal colors illustrate the predicted 3D movement level values, such that points with high/low 3D motion are presented in red/purple colors respectively. These outputs are used to reproject the predicted points into the frames for calculating the reprojection error losses. See details in the text.

Results

Citation