LiDAR point cloud compression for 3D object detection
Chief investigater
Yue-Ru Hou, Jui-Chiu Chiang
Fig. 1: The architecture of proposed CompressDet3D.
ABSTRACT
Real-time downstream tasks, such as 3D object detection, face significant challenges due to the large volume of 3D LiDAR point cloud data, which affects transmission, storage, and processing efficiency. Traditional point cloud compression methods focus on geometric reconstruction, but complete reconstruction is often unnecessary for many applications. We propose an end-to-end point cloud compression technique specifically designed for 3D object detection. By extracting and compressing point cloud features using an encoder, and feeding the decompressed features directly into a partial detection network, our method enables joint training of both the compression and detection networks. This approach reduces bit rate requirements while maintaining detection performance, achieving a significantly high bit rate reduction compared to conventional reconstruction-based methods.
PROPOSED METHOD
The proposed architecture, illustrated in Fig. 1, consists of two key components: Main Encoder and Partial Detection Network. The input point cloud
A.Main Endoer
Fig. 2: The network architecture of proposed encoder module.
A temporal fusion module is employed to effectively utilize spatiotemporal features from two previous frames , as show in Fig. 2 . This module intersects and unifies features across previous frames, enriching feature information and extracting essential information common across different frames. These features are then concatenated at each scale, achieving comprehensive spatiotemporal fusion. Finally, “convolution on target coordinates” maps features from the previous frames coordinates to those of the current frame, enabling the prediction of the current frame's features using multi-scale coordinates from previous frames.
B.Partial Detection Network
Fig. 3: The network architecture of proposed partial detection module.
To build an accurate context model for dynamic point cloud compression, it is crucial to leverage temporal relationships between neighboring point cloud frames. Directly inputting large-scale feature representations into a transformer incurs high computational costs. We introduce a Transformer-based Entropy Model (TEM) to encode the residual features as shown in Fig.3 . However, the inherent unordered and sparse nature of point clouds presents a challenge compared to video coding. To tackle this, we first conduct a Morton scan on the 3D points, transforming spatially irregular 3D points into a one-dimensional (1D) sequence, which serves as input to the Transformer.
Fig. 4: Regular sparse convolution block & Submanifold sparse convolution block.
To build an accurate context model for dynamic point cloud compression, it is crucial to leverage temporal relationships between neighboring point cloud frames. Directly inputting large-scale feature representations into a transformer incurs high computational costs. We introduce a Transformer-based Entropy Model (TEM) to encode the residual features as shown in Fig.3 . However, the inherent unordered and sparse nature of point clouds presents a challenge compared to video coding. To tackle this, we first conduct a Morton scan on the 3D points, transforming spatially irregular 3D points into a one-dimensional (1D) sequence, which serves as input to the Transformer.
EXPERIMENT
A.Dataset
For training, we utilize the 8iVoxelized Full Bodies (8iVFB) dataset comprising four sequences: longdress, loot, redandblack, as shown in Fig.4 , and soldier, each with 300 frames. All sequences are quantized to 9-bit geometry precision. For testing, we employ the Owlii dynamic human sequence dataset containing four sequences: basketball, dancer, exercise, and model, as shown in Fig.5, all quantized to 10-bit geometry precision. The division of training and testing samples adheres to the Exploration Experiment (EE) guidelines recommended by the MPEG AI-PCC group.
Fig. 5: Training Dataset & Testing Dataset. (KITTI)
Fig. 6: Scene display. (KITTI)
B.Performance Comparison
We evaluated the performance of our proposed method against VPCCv18 (as specified in MPEG EE 5.3), and also learned dynamic point cloud geometry compression methods, including DPCC-SC, SparsePCGCv3, D-DPCC, Hie-ME/MC and DPCC-STTM. And also comparing the result with Mamba-PCGC. Rate-distortion curves for different methods are presented in Fig.6 , and corresponding BD-Rate gains are shown in Table.1 and Table.2.
Fig. 7: Comparison set.
C.Performance Comparison
We evaluated the performance of our proposed method against VPCCv18 (as specified in MPEG EE 5.3), and also learned dynamic point cloud geometry compression methods, including DPCC-SC, SparsePCGCv3, D-DPCC, Hie-ME/MC and DPCC-STTM. And also comparing the result with Mamba-PCGC. Rate-distortion curves for different methods are presented in Fig.6 , and corresponding BD-Rate gains are shown in Table.1 and Table.2.
Fig. 8: Bpp-AP curves comparison.
Our proposed method significantly outperforms VPCCv18 across all testing sequences, achieving an average BD-Rate gain of 82.53%. Notably, the dancer sequence exhibits the highest BD-Rate gain compared to other sequences. Furthermore, our method achieves up to 43.39% BD-Rate gain compared to other existing learning-based solutions for DPCGC.
Table. 1: Complexity comparison.