Abstract
This paper presents an efficient system for simultaneous dense scene reconstruction and object labeling in real-world environments (captured with an RGB-D sensor). The proposed system starts with the generation of object proposals in the scene. It then tracks spatio-temporally consistent object proposals across multiple frames and produces a dense reconstruction of the scene. In parallel, the proposed system uses an efficient inference algorithm, where object class probabilities are computed at an object-level and fused into a voxel-based prediction hypothesis modeled on the voxels of the reconstructed scene. Our extensive experiments using challenging RGB-D object and scene datasets, and live video streams from Microsoft Kinect show that the proposed system achieved competitive 3D scene reconstruction and object labeling results compared to the state-of-the-art methods.