Recent advances in camera-based occupancy prediction aim to jointly estimate 3D semantics and scene flow. We propose VoxelSplat, a novel regularization framework that leverages 3D Gaussian Splatting to improve learning in two key ways:
(i) 2D-Projected Semantic Supervision: During training, sparse semantic Gaussians decoded from 3D features are projected onto the 2D camera view, enabling camera-visible supervision to guide 3D semantic learning.
(ii) Enhanced Scene Flow Learning: Motion is modeled by propagating Gaussians with predicted scene flow, allowing enhanced flow learning from adjacent-frame labels.
VoxelSplat integrates easily into existing occupancy models, improving both semantic and motion predictions without increasing inference time.
Predict 3D semantics (occupancy) and object motion (flow) from temporal surround-view images.
The overview of our framework: (1) Employing an occupancy model integrated with our flow decoder to predict occupancy and scene flow. (2) Sampling coordinates from occupied voxel centers using ground truth labels to extract features, semantic logits, and scene flow. Then, 3D semantic Gaussians are decoded. (3) Dividing the Gaussians into static and dynamic types, with dynamic ones updated by predicted scene flow. (4) Rendering static and dynamic Gaussians separately for 2D supervision.
The 3D occupancy prediction performance on the nuScenes validation set is evaluated. The RayIoU and mAVE results are obtained using the annotations from OpenOcc, while the mIoU results are based on the Occ3D annotations. BEVDet-Occ-flow and FB-Occ-flow represent the models with our scene flow decoder integrated into the original architectures.
The qualitative comparison of our occupancy prediction with other methods is presented. We highlight the regions where our method shows clear superiority using red boxes, emphasizing the areas where the performance differences are most noticeable.
We present a qualitative comparison of our scene flow prediction with the ground truth. We use a color scale to represent the magnitude of the flow. Red arrows are employed to indicate both the direction and magnitude of the flow.
Our strategy of explicit modeling of the occupancy field with 3D Gaussians and splat rendering supervision helps the original loss functions, including occupancy and flow loss, find a better convergence direction.
The visualization results of rendering semantics and depths on the validation dataset are presented. The first, second, and third rows are the input images, the rendered semantics, and the rendered depth maps, respectively.