Abstract

In recent years, the mining industry has encountered challenges, such as a shortage of human resources, an ongoing emphasis on safety enhancements, and increased ecological preservation requirements. Autonomous mining trucks have emerged as a novel solution to effectively address these issues within open-pit mining operations. To meet the demanding conditions of open-pit mines, characterized by intense vibrations and extreme temperature variations, hybrid solid-state LiDAR has emerged as the primary choice for perception sensors. Recognizing the distinct data structure and distribution disparities between point clouds obtained through nonrepetitive scanning methods of hybrid solid-state LiDAR and traditional mechanical LiDAR, this paper proposed an innovative LiDAR 3D object detection model, PointPillars-HSL (PointPillars-Hybrid Solid-state LiDAR). This approach harmonizes the unique characteristics of open-pit mining environments and hybrid solid-state LiDAR point clouds. It optimizes the model’s preprocessing methodology, augments the dimensionality of pillar features, fine-tunes the loss function, and employs transfer learning techniques to reduce the reliance on specific datasets. The result is the effective deployment of a 3D object detection algorithm customized for hybrid solid-state LiDAR within the specific operational framework of open-pit mining. This achievement has yielded a noteworthy overall vehicle recognition rate of 89.72%.

1. Introduction

1.1. Application Background

In recent years, the autonomous driving industry has prospered, but there are very few autonomous driving projects that have actually achieved actual benefits. For short-distance, fixed-site, high-intensity open-pit mine transportation, autonomous driving technology has high practical value, and closed mining areas are also one of the most likely scenarios for autonomous driving. An automatic driving system of a mining truck generally consists of an autonomous mining truck, a communication system, and a central monitoring system, as shown in Figure 1. These systems work together to achieve efficient scheduling and operation in open-pit mining. The advantages of autonomous mining trucks are reflected in not only solving the transportation problems in smart mines and improving the efficiency of transportation operations in open-pit mines but also reducing safety accidents in open-pit mines, optimizing production management, solving the problem of difficult recruitment of mining truck drivers, and reducing labor costs.

The transportation road in the open-pit mine is an unstructured road without obvious road structure characteristics, and the road surface is relatively undulating. A large amount of dust in dry weather and accumulated water and snow after rain and snow will also have a great impact on road conditions. In such a complex and changeable environment and nighttime operating conditions without external lighting, camera-based perception cannot stably obtain surrounding environment information. The LiDAR-based perception method is more adaptable to the mining environment and can meet the 7 × 24 hr of autonomous mining truck operation requirements.

1.2. Hybrid Solid-State LiDAR and Point Cloud

Although LiDAR is the best perception sensor for autonomous mining trucks, the production conditions in the open-pit mine are very harsh. The LiDAR mounted on the mining truck not only has to withstand the test of high and low temperatures and strong vibration but also faces severe mechanical shock during loading and unloading. Therefore, the performance of LiDAR directly determines the accuracy and stability of environmental perception.

LiDAR mainly includes laser emission, reception, scanner, and signal processing circuits. According to different scanning methods, LiDAR can be divided into three types: mechanical LiDAR, hybrid solid-state LiDAR (semi-solid-state LiDAR), and solid-state LiDAR. The mechanical LiDAR drives the optical–mechanical structure to rotate 360° through a motor and can scan the surrounding environment in all directions to form a point cloud. However, the mechanical LiDAR adopts the traditional discrete design, which is bulky and relies heavily on moving parts, so it is not suitable for application in the environment of extreme temperature variations and strong vibration and shock in open-pit mines. Compared with mechanical LiDAR, hybrid solid-state and solid-state LiDAR generally can only achieve a horizontal field of view of 120°, but hybrid solid-state has fewer rotatable moving parts, and solid-state has no mechanical movement, which has higher stability and is shrinking in size and cost. The hybrid solid-state LiDAR is currently the popular LiDAR solution for mass-produced vehicles, and its technologies are relatively mature.

The data collected by LiDAR is called point cloud. A point cloud is a collection of massive points that express the spatial distribution of targets and surface characteristics of objects in a 3D space coordinate system. The content of the point cloud includes 3D coordinates (XYZ) and reflection intensity (Intensity). The intensity information is related to the surface material, roughness, incident angle and direction of the object, as well as the emitted energy of the laser and the wavelength of the laser.

The point cloud itself has a natural disorder, which means that exchanging any point will not affect the 3D description of the spatial object. Similarly, translation, rotation, and scaling will not affect its 3D description. This feature can be used to perform data augmentation on point clouds before network training. In addition, point clouds are also characterized by sparsity, which has low resolution in 3D space, so 3D convolutions [1] performed on point clouds are not as effective as 2D convolutions on images.

The characteristics of the open-pit mine’s lidar point cloud lie in the significant differences in the distribution of the z-coordinates of object centers beneath undulating roads, with object sizes generally larger than those in urban scenes. The environment is characterized by unstructured roads, and dust is commonly present. The dust not only introduces noise to the point cloud data but also accumulates on the specular surfaces of the lidar, leading to an overall decrease in the quality of the point cloud data. Additionally, the vibration during the movement of vehicles in the mining area is much greater than that in an urban scene, which can interfere with point cloud data, resulting in data distortion.

1.3. LiDAR Perception Algorithm

Open-pit mines generally have strict restrictions on the entry and exit of vehicles and pedestrians, so the target obstacles appearing in the mining stripping section of the open-pit mine are mainly vehicle types, including electric shovels, mining trucks, bulldozers, road graders, sprinklers, etc., and the vehicle is large and easy to detect, and the probability of pedestrians is extremely low, which is more suitable for the application of object detection methods in deep learning.

3D object detection methods in deep learning rely on data-driven [2] input of massive sample data, and the detection accuracy can exceed the traditional method of feature extraction. Therefore, data become the key to deep learning algorithms. At present, the public datasets of autonomous driving containing point cloud data are increasing year by year and cover tasks such as object detection, object tracking, semantic segmentation, mapping, and positioning, such as KITTI [3], SemanticKITTI [4], BLVD [5], Waymo, nuScenes, and so on. However, the cover scenes of these datasets are mainly on structured roads in cities, while open pit mines are unstructured scenes. The object types, object sizes, and perceived environments of the two scenarios are quite different, and the existing public datasets cannot be directly used as training data for 3D object detection on unstructured roads in open-pit mines. At present, the road conditions of open-pit mines are complex, and with the advancement of mining, the road environment will always be changed, which brings great challenges to data collection. In addition, in the current open-pit mine dataset, the proportion of mining trucks in the dataset is very high, reaching more than 50%, while the proportion of other objects is relatively small, resulting in poor class balance in the dataset.

The traditional LiDAR point cloud processing method typically begins by segmenting and extracting nonground points, then clusters obstacle points to obtain obstacle features, and performs multiobject matching and tracking based on object features. The complex and changeable road conditions of open-pit mines bring challenges to the nonground point segmentation and extraction of traditional point cloud detection methods. Moreover, the traditional method cannot distinguish the types of vehicles in open-pit mines, and the generalization ability of point cloud features based on rule extraction is limited. For problems such as false positives of vehicles caused by street signs and retaining walls, as shown in Figure 2, traditional point cloud processing methods may not be as good as deep learning methods.

Some early algorithms based on deep learning, such as point-based PointNet [6] and PointNet++ [7], these type of algorithms need to map the point set features back to the original point cloud after calculating the neighborhood features, which have a large time complexity. The voxel-based algorithm represented by VoxelNet [8] has a small proportion of nonempty voxels, and the data expression is inefficient. Moreover, the calculation of the 3D convolution calculation features used is very computationally intensive, and the reasoning is very time-consuming. Some methods, such as AVOD [9], project point clouds onto a 2D plane and employ image-based approaches for object detection. However, the projection process inevitably leads to the loss of certain geometric spatial information, resulting in shortcomings in depth prediction. In this context, PointPillars [10, 11] has garnered widespread attention due to its ability to strike a favorable balance between inference speed and detection accuracy. PointPillars achieves efficient object detection by converting point cloud data into a compact voxel representation and employing a columnar structure-based processing approach. Compared to other Bird’s Eye View (BEV)-based methods, PointPillars demonstrates a significant advantage by consistently improving inference speed to 60 frames per second (60 Hz) while maintaining a certain level of detection accuracy. Figure 3 illustrates a comparison between PointPillars and other BEV-based methods, such as Frustum PointNet [12] and PIXOR++ [13] in terms of detection accuracy and speed. It is evident that PointPillars manages to achieve higher inference speed while maintaining a certain level of detection accuracy. Baidu Apollo 6.0 has pioneered the adoption of the PointPillars-based algorithm for LiDAR point cloud detection, achieving a threefold increase in detection frequency compared to SECOND [14].

However, although the object features extracted by PointPillars include the relative distances of each point to the arithmetic mean of all points within the pillar, the subsequent pooling operations do not effectively preserve this feature. A major feature of the open-pit mine dataset is that the obstacles are concentrated as vehicles, and the target size is large. The pillar that contains the point cloud of the vehicle is quite different from other Pillars that do not contain obstacles in point cloud density and point cloud distribution in the z-direction.

In light of the challenges posed by the complex and dynamic nature of unstructured road environments in open-pit mining, as well as the harsh operating conditions and lack of suitable datasets, this paper presents a novel 3D object detection model for LiDAR, named PointPillars-HSL (PointPillars-Hybrid Solid-State LiDAR). This model is designed by taking into account the characteristics of hybrid solid-state LiDAR point clouds in practical applications. This approach builds upon the foundation of the PointPillars model by optimizing the preprocessing structure and loss function, enhancing the dimensions of point cloud features, and utilizing transfer learning [15] methods to mitigate the need for specific datasets. As a result, improved detection performance is achieved on both the training and deployment platforms.

2. PointPillars-HSL Object Detection Algorithm

2.1. Data Preprocessing

The three elements of artificial intelligence are data, algorithms, and computing power, and data plays a leading role in the three elements. The detection dataset of the open-pit mine comes from the LiDAR of Innovusion mounted on the mine truck and Figure 4 shows a frame of mining area point cloud data collected by Innovusion. The LiDAR has a maximum detection range of 500 m, with a detection range of 250 m at 10% reflectivity. It features a 120° horizontal field of view, a 25° vertical field of view, and a resolution of 0.18° × 0.24°. The ranging accuracy achieves ±5 cm.

The PointPillars algorithm filters out point clouds that are projected outside the image bounds based on their projection in the image. However, due to distortion and instability in the fisheye cameras on the mining trucks, the PointPillars-HSL algorithm cannot perform filtering based on images. During the data preprocessing, an appropriate region of interest for point clouds is defined, allowing for initial downsampling of the point cloud data.

Due to the point clouds captured by the fixed-position LiDAR on mining trucks being influenced by the uneven terrain and steep road conditions in the open-pit mine, there is a wide distribution range of the Z-coordinates of the detected objects. If a small point cloud filtering range is applied in the z-direction, it will result in the exclusion of many obstacle targets. This article analyzes and counts the object height of the mining truck dataset to obtain the point cloud height range of the vehicle and dynamically adjusts it in the data preprocessing step of the model training process.

Compared with the anchor-free algorithm [16], the PointPillars-HSL algorithm is particularly sensitive to the setting of anchor boxes. It is difficult for the detection network to learn the object bounding box directly from the data. The role of the anchor is equivalent to giving the network a learning template about the object size in advance. Building upon this foundation, the network undergoes further learning to accurately derive 3D boundingbox boxes. The anchor parameters in the general PointPillars model are representative of vehicles in urban road scenarios and may differ from the vehicles encountered in open-pit mines. Consequently, after performing precise point cloud annotation on a subset, anchor box dimensions for different vehicle categories in the open-pit mine were determined.

2.2. Transfer Learning

Open-pit mines have a limited number of vehicles, and the close relationship between vehicle categories and operational scenarios leads to a relatively low efficiency in data collection. Additionally, the steep and unstructured terrain in open-pit mines introduces challenges in annotation. At present, the number of dataset samples in open-pit mines is relatively limited. If training is started from scratch, the stability and reliability of the model cannot be guaranteed. Therefore, this paper uses the transfer learning method to reduce the number of training data. Transfer learning is a branch of machine learning that exploits the similarity between data or tasks to apply a model trained in an old domain to model initialization in a new domain. The transfer learning approach utilized in this study is based on parameter transfer within a supervised learning framework. Given that shallow networks possess generic features and strong transferability, the model parameters trained on the KITTI urban road dataset were employed in the training of a detection model tailored to open-pit mining scenarios. The pretrained model serves as the starting point for model learning, with further adjustments to the parameters of the deep network aimed at enhancing the model’s generalization performance.

Transfer learning is founded on two fundamental concepts: domain and task. Domains are divided into source and target domains. In this paper, the source domain is the urban road scene, while the target domain is the open-pit mine. The task being learned in both domains is consistent, involving 3D object detection. Moreover, the designed model structures are highly similar, making them suitable for parameter-based transfer learning. Leveraging the abundant labeled samples and domain adaptation available in the urban road scene, transfer learning can alleviate the challenges associated with learning vehicle features in the open-pit mine.

2.3. Model Structure

The PointPillars-HSL network structure proposed in this paper is mainly divided into three parts: PFN+ (Pillar Feature Net+), 2D Backbone [9], and SSD detection head, as shown in Figure 5.

In the PFN+ module, the downsampling of point clouds is achieved by adjusting the division of Pillars (cuboidal point cloud segments). This approach effectively addresses the issue of nonuniform distribution of distant and nearby point clouds. Compared to the 360° rotating scan of mechanical LiDARs, the nonrepetitive scanning approach of hybrid solid-state LiDAR ensures that the scanning path does not repeat and the illuminated area within the field of view increases over time. Figure 6 illustrates the accumulation of point clouds from the hybrid solid-state LiDAR as time progresses.

The hybrid solid-state LiDAR point clouds exhibit a distinct characteristic of being sparse in distance and dense in proximity. To optimize the pillar generation in the PointPillars-HSL model, a different approach is taken during the pillar generation phase. Instead of discretizing the point clouds into a grid with uniform spacing, the model adapts to the actual characteristics of the point clouds in the open-pit mine. In this paper, smaller pillar sizes are selected for denser point clouds in closer proximity, leading to the generation of more features and achieving finer localization. Sparse point clouds in the distance are mapped to larger pillars. This strategy not only reduces the number of pillars but also enhances the features of distant point clouds, thus improving the detection performance. The entire process and details can be seen in Figure 7.

Due to the presence of pooling operations causing sparse features in the PFN+ module, the proposed PointPillars-HSL approach introduces additional features at each individual point to retain the pillar density and point cloud’s z-direction distribution information extracted from all points within a pillar. Specifically, the process involves concatenating the offsets of each point within every pillar to both the center point of the pillar and the arithmetic mean point of all points within the pillar, along with the x, y, z coordinates, intensity values, density features, and z-direction characteristics of each point. After partitioning the point cloud into Pillars, constraints are applied to the nonempty pillar count (P) per sample and the point count (N) per pillar. This is done to construct a tensor of size (D, P, N), where D = 12, representing the feature dimensionality of points within Pillars. This augmented feature set includes the original 10 dimensions and incorporates density features and distribution characteristics. Subsequently, the tensorized point cloud data undergoes processing and feature extraction to yield a tensor of size (C, P, N). Finally, a max-pooling operation is applied along the third channel, resulting in an output tensor of size (C, P). This encoded feature is then redistributed back to the original pillar positions, forming a pseudo-image of dimensions (C, H, W), where P = H × W.

The density calculation formula for each pillar is as follows, where N represents the number of points within each Pillar:

The vertical distribution of the point cloud can be reflected by the variance of the z coordinates of all points within each Pillar:

The Backbone (2D convolution) consists of two components: one sub-network performs progressive downsampling on the pseudo-image to extract features at different scales, while another network performs upsampling on features extracted from top to bottom, resizing the feature maps to match the original pseudo-image size. This facilitates channel-wise concatenation at a consistent scale. The PointPillars-HSL algorithm continues to adopt the SSD detection head [17], predicting the categories, positions, and orientations of the objects.

2.4. Loss Function and Model Training

By analyzing the object categories within the training dataset, it is evident that the category of mining trucks has the highest occurrence probability, while categories like water trucks, graders, and bulldozers have lower proportions. This imbalance in class distribution might lead to a bias toward predicting mining trucks in the final predictions, consequently impacting the overall detection accuracy. In order to address the problem of imbalanced object categories within the training dataset, the algorithm employs Focal Loss [18] in the loss function. This allows for assigning higher weights to challenging samples, which are those that are prone to being misclassified. The key concept of Focal Loss involves introducing a tuning factor (focusing parameter) to adjust the weights of samples. This tuning factor results in low-difficulty samples contributing less weight in the loss calculation, while high-difficulty samples contribute greater weight. By utilizing Focal Loss, the weight of easily classifiable samples can be reduced, allowing the model to focus more on learning from challenging samples. This approach helps mitigate the impact of class imbalance in object detection tasks. The Focal Loss is as follows:

In the equation, represents the predicted probability by the model, and is the tuning factor used to control the weight of challenging samples. When is set to 0, Focal Loss becomes equivalent to the cross-entropy loss function. Increasing the value of can further enhance the weight of difficult samples.

The loss function includes the position regression loss function, classification loss function, and orientation regression loss function.

The position regression involves components (x, y, z, w, l, h, θ). Taking the center point coordinate x as an example, represents the true value of the center point x in the annotated target, while represents the predicted x value of the 3D anchor, with . The calculation of the components for the position regression loss is as follows:

Smooth L1 [19] is defined as follows:

The position regression loss function incorporates orientation regression loss. If only the angle difference is computed, situations can arise where the position prediction is correct but the orientation is opposite, leading to a situation where the angle loss is excessively high and affects the overall loss function value . Therefore, the sine function is applied to the angle difference to address this issue. However, solely using the sine function may not distinguish and penalize two predicted angle values that are opposite in orientation. The algorithm proposed in this paper introduces an orientation regression classifier, defining two directions as positive and negative, constraining the angles within the (0, 2π) interval, mapping angles in the range [0, π) to 0 and angles in the range (π, 2π) to 1, representing them using one-hot encoding [20]. Subsequently, it employs the softmax and cross-entropy loss functions to compute another orientation regression loss value, denoted as .

In summary, the entire algorithm model’s loss function is composed of as follows:where represents the number of positive samples, and in training, the values for , , and are set to 2, 1, and 0.5, respectively.

The training process adjusts the batch size based on the GPU configuration of the training platform. A larger batch size means that a greater number of samples are fed into the network at once, allowing the determined directions to better represent the overall dataset. However, it also requires a larger GPU memory capacity. Indeed, with a larger batch size, while keeping the overall dataset size constant, there will be fewer iterations per epoch. Therefore, it’s necessary to increase the number of epochs to achieve more iterations and potentially better results.

3. Model Evaluation and Deployment

3.1. Model Optimization Evaluation Results

The final detection results of the PyTorch Model are Shown in Figure 8.

From Figure 8, it can be observed that the model performs well in detecting vehicles of different classes, even when they are occluded or partially visible. According to the analysis of the test dataset, the model also exhibits a certain degree of dust-resistant detection capability, with high confidence in vehicle detection, which makes it suitable for vehicle detection in complex and dynamic environments such as open-pit mines. The effect of dust suppression detection is shown in Figure 9.

The loss function curve during model training is depicted in Figure 10.

To evaluate the performance of the PointPillars-HSL model with optimized feature preprocessing, a comparison was made against the PointPillars model without pretrained parameters on the same batch of mining area dataset, using identical batch size and learning rate strategy. The detection metrics on the open-pit mine dataset are shown in Table 1.

To verify the added pillar internal density features and z-direction point cloud distribution features, an ablation study was designed, as shown in Table 2. Feature Enhancement means adding density features and z-direction point cloud distribution features to the original 10D features. We can see that the model’s mean average precision (mAP) for all classes improves about 8.13% for added 2D features.

To verify the impact of pillar grid division on model detection accuracy and speed, an ablation study, as shown in Table 2, was designed, and the detection results were measured on a device equipped with an Nvidia GeForce RTX 3050 Ti using the test dataset. The PointPillars model takes about 98.5 ms (10.15 Hz) to process each sample, and PointPillars-HSL uses a novel pillar grid division method to reduce the detection time by 11.3 ms without significant loss of accuracy.

3.2. Model Deployment

The model is deployed on an NVIDIA Tegra Xavier. After training, the PyTorch model file is converted to an ONNX file, and then inference acceleration is achieved using TensorRT to obtain the detection results.

The detection results of the PyTorch model, as shown in Figure 11, are based on a coordinate system defined in the OpenPCDet framework: with the positive x-axis representing the forward direction, clockwise as positive, and angles ranging from (−π, π).

In contrast, the reference coordinate system for the deployed inference results, as shown in Figure 12, is consistent with the training annotation data reference system. It uses the rightward direction as a reference, clockwise as positive, and also has an angle range of (−π, π). The two coordinate systems differ by 90°.

The inference detection speed on the NVIDIA Tegra Xavier development board is approximately 65 ms/frame.

4. Discussion

In summary, our research is the first to implement a sensing solution based on hybrid solid-state LiDAR in an open-pit mining scene. Due to the low probability of auxiliary vehicles such as sprinklers and command vehicles appearing in open-pit mines, sufficient data are lacking. In the future, with enough data, the detection solution can cover all types of vehicles in the mine. Combined with the traditional point cloud detection and segmentation algorithm, it can detect irregular objects such as stones and walls, taking the autonomous driving of mining trucks one step further.

5. Conclusions

This paper addresses the challenges faced in the perception of autonomous driving environments in mining, characterized by harsh operating conditions and complex, ever-changing road environments. We propose the PointPillars-HSL 3D object detection algorithm, which is suitable for mining environments and utilizes a hybrid solid-state LiDAR system. After analyzing the data from the hybrid solid-state LiDAR point cloud and by implementing downsampling and feature optimization based on Pillars, along with data preprocessing and transfer learning techniques, we have effectively addressed the issues of overly dense point clouds in open-pit mine and the significant slopes in unstructured road scenes. Through the optimization of the loss function, the stability of the model in predicting obstacle orientation and category has been enhanced. Furthermore, deploying the well-trained algorithm model on the NVIDIA Tegra Xavier has enabled real-time inference for 3D point cloud object detection.

Data Availability

Data is available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Conceptualization was done by Cheng Li and Teng Long; software-related task was done by Cheng Li; validation was done by Gang Yao and Peijie Li; investigation was done by Gang Yao; resources were provided by Xiwen Yuan; data curation was done by Gang Yao and Peijie Li; writing—original draft preparation was done by Cheng Li; writing—review and editing was done by Teng Long; visualization was done by Cheng Li and Gang Yao; project administration was done by Xiwen Yuan. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

We sincerely appreciate CRRC Zhuzhou Institute Co., Ltd. for providing valuable experimental opportunities for this study. We also deeply value the significant contributions of our esteemed reviewers, whose prompt and insightful evaluations have enhanced the scientific value of the research. This research was funded by the National Key Research and Development Program of China, grant number 2022YFB4300405.