Fast Encoder-Based 3D from Casual Videos via Point Track Processing

Wuyue Lu²

¹NVIDIA Research

²Simon Fraser University

³Technion

NeurIPS 2024

TracksTo4D receives video frames with a set of pre-extracted point tracks presented in corresponding colors as inputs (left side), and maps them into dynamic 3D structures and camera positions (right side). The output camera trajectory is presented as gray frustums, whereas the current camera is marked in red. The reconstructed 3D scene points are presented in corresponding colors to the input tracks. Note that the outputs presented on this web page are obtained at inference time, with a single feed-forward prediction, without any optimization or fine-tuning, on unseen test cases.

Overview

We tackle the long-standing challenge of reconstructing 3D structures and camera positions from several scene images. The problem is particularly hard when objects are transformed in a non-rigid way. Current approaches to this problem make unrealistic assumptions or require a long optimization time. We present TracksTo4D, a learning-based approach to infer 3D structure and camera positions from in-the-wild videos. Specifically, we build on recent progress in point tracking, to extract long-term point tracks from videos, and learn class-agnostic features. We then design a deep equivariant neural network architecture (Figure 2), that maps the point tracks of a given video into corresponding camera poses, 3D points, and per-point non-rigid motion level values. Training on pet videos, we observe that by just "watching" point tracks in videos, our network learns to infer their 3D structure and the camera motion.

Our model is trained on the Common Pets dataset using only 2D point tracks extracted by CoTracker, without any 3D supervision by simply minimizing the reprojection errors. We evaluate our method on test data with GT depth maps and demonstrate that it generalizes well across object categories. With a short bundle adjustment, we achieve the most accurate camera poses compared to previous methods, all while running faster. More specifically, compared to the state-of-art, we reduced the Absolute Translation Error by 18%, the Relative Translation Error 21%, and the Relative Rotation Error by 15%. Moreover, our method produces depth accuracy comparable to the state-of-the-art method, while being ~x10 faster. We also demonstrate a certain level of generalization to entirely out-of-distribution inputs.

Pipeline

Fast Encoder-Based 3D from Casual Videos via Point Track Processing

Overview

Pipeline

Results

Citation