VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

Abstract

Recent advances in camera-based occupancy prediction aim to jointly estimate 3D semantics and scene flow. We propose VoxelSplat, a novel regularization framework that leverages 3D Gaussian Splatting to improve learning in two key ways:
(i) 2D-Projected Semantic Supervision: During training, sparse semantic Gaussians decoded from 3D features are projected onto the 2D camera view, enabling camera-visible supervision to guide 3D semantic learning.
(ii) Enhanced Scene Flow Learning: Motion is modeled by propagating Gaussians with predicted scene flow, allowing enhanced flow learning from adjacent-frame labels.
VoxelSplat integrates easily into existing occupancy models, improving both semantic and motion predictions without increasing inference time.

Video Demo

Input Images

Predicted Occupancy (surround-view)

Predicted Occupancy

Predicted Scene Flow

Predict 3D semantics (occupancy) and object motion (flow) from temporal surround-view images.

Method Overview

The overview of our framework: (1) Employing an occupancy model integrated with our flow decoder to predict occupancy and scene flow. (2) Sampling coordinates from occupied voxel centers using ground truth labels to extract features, semantic logits, and scene flow. Then, 3D semantic Gaussians are decoded. (3) Dividing the Gaussians into static and dynamic types, with dynamic ones updated by predicted scene flow. (4) Rendering static and dynamic Gaussians separately for 2D supervision.

Improvement on nuScenes Dataset

The 3D occupancy prediction performance on the nuScenes validation set is evaluated. The RayIoU and mAVE results are obtained using the annotations from OpenOcc, while the mIoU results are based on the Occ3D annotations. BEVDet-Occ-flow and FB-Occ-flow represent the models with our scene flow decoder integrated into the original architectures.

Occupancy Prediction

BEVDet-Occ FB-Occ Ours Ground Truth

The qualitative comparison of our occupancy prediction with other methods is presented. We highlight the regions where our method shows clear superiority using red boxes, emphasizing the areas where the performance differences are most noticeable.

Scene Flow Prediction

Ours

Ground Truth

We present a qualitative comparison of our scene flow prediction with the ground truth. We use a color scale to represent the magnitude of the flow. Red arrows are employed to indicate both the direction and magnitude of the flow.

Effect of Rendering Supervision on Convergence

Our strategy of explicit modeling of the occupancy field with 3D Gaussians and splat rendering supervision helps the original loss functions, including occupancy and flow loss, find a better convergence direction.

Visualization of Rendering Results

The visualization results of rendering semantics and depths on the validation dataset are presented. The first, second, and third rows are the input images, the rendered semantics, and the rendered depth maps, respectively.

BibTeX

@inproceedings{zhu2025voxelsplat, author = {Zhu, Ziyue and Wang, Shenlong and Xie, Jin and Liu, Jiang-Jiang and Wang, Jingdong and Yang, Jian}, title = {VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025}, month = {June}, note = {To appear}, }