Given a 3D point cloud stream and the language-3D captions, our method achieves real-time, causal 6-DoF pose tracking while
reconstructing the 3D shape in the current observation. We demonstrate that: (a) our method not only enables zero-shot inference for
unseen objects with known categories, (b) but also perfectly showcases the zero-shot capabilities for unseen objects with unknown classes.
Illustration of the pipline of proposed methodology. Given input point cloud stream along with the corresponding segmented
mask, we first encode them with both 2D/3D backbone and cross-coupled fusion module separately to obtain inter-frame embeddings
ft−1, ft. These paired embeddings are then used to model the energy-based hypothesis about changes in pose and learn a neural pose-
aligned field that generates shape query while aligning its pose for an arbitrary object. Meanwhile, these embeddings will be aligned
with the extra input multi-level language instructions using proposed GPT-assisted assocaition and alignment modules to achieve zero-shot
inference. It’s noteworthy that the caption embeddings fc are added into ft−1, ft to enhance its performance during the inference stage.
@article{sun2024l4d,
author = {Jingtao Sun, Yaonan Wang, Mingtao Feng, Yulan Guo, Ajmal Saeed Mian and Mike Zheng Shou},
title = {L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream},
booktitle = {Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages = {xxx--xxx},
year = {2024}
}