Abstract

Deep learning has recently received extensive attention in the field of rolling-bearing fault diagnosis owing to its powerful feature expression capability. With the help of deep learning, we can fully extract the deep features hidden in the data, significantly improving the accuracy and efficiency of fault diagnosis. Despite this progress, deep learning still faces two outstanding problems. (1) Each layer uses the same convolution kernel to extract features, making it difficult to adaptively select convolution kernels based on the features of the input image, which limits the network’s adaptability to different input features and leads to weak feature extraction. (2) Large number of parameters and long training time. To solve the above problems, this paper proposes an integrated deep neural network that combines an improved selective kernel network (SKNet) with an enhanced Inception-ResNet-v2, named SIR-CNN. First, based on the SKNet, a new three-branch SKNet was designed. Second, the new SKNet is embedded into a depthwise separable convolution network such that the model can adaptively select convolution kernels of different sizes during training. Furthermore, the convolution structure in the Inception-ResNet-v2 network was replaced by the improved depthwise separable convolution network to achieve effective feature extraction. Finally, the time-frequency maps of the raw vibration signals are obtained through short-time Fourier transform (STFT) and then sent to the proposed SIR-CNN network for experiments. The experimental results show that the proposed SIR-CNN achieves superior performance compared to other methods.

1. Introduction

Rolling bearings are widely used in nuclear energy, wind power, aerospace, petrochemicals, electric power, and other industrial fields [1]. With a series of advantages, such as high precision, good substitutability, and low price, rolling bearings are the core components of mechanical equipment, especially rotating mechanical equipment [2]. However, as the “joint” between the rotating and fixed parts, the operating conditions of rolling bearings inevitably change, owing to the long-term effects of high temperature, high speed, and a variety of alternating loads [3]. In addition, owing to processing errors, poor lubrication, thermal fatigue, work wear, and other factors, it is easy to develop faults in the rolling element, inner race, outer race, and cage inside the bearing [4]. According to statistics, 45%∼55% of mechanical failures are caused by bearing failure [5]. When a rolling bearing fails, it will affect the normal operation of other parts of the equipment, causing a series of chain damage reactions, even leading to serious consequences of machine damage and human death. Therefore, the fault diagnosis of rolling bearings is of great significance [6, 7].

Traditional bearing fault diagnosis methods include noise analysis, acoustic diagnosis, temperature measurements, oil film resistance diagnosis, and vibration signal analysis [8]. Among them, diagnostic technology based on vibration signal analysis is applicable to rolling bearings under various working conditions, which has the advantages of obvious fault characteristics and high diagnostic accuracy [9]. Therefore, it has been widely used for bearing fault diagnoses. More than 80% of the existing literature on bearing fault diagnosis employs vibration signal-analysis methods. This method usually uses manual approaches such as the fast Fourier transform (FFT) [10], wavelet transform (WT) [11], and empirical mode decomposition (EMD) [12] to extract signal features and then uses a support vector machine (SVM) [13], K-nearest neighbor (KNN) [14], and BP neural network (BPNN) [15] to obtain diagnostic results. However, these feature extraction methods rely on expert experience and knowledge, which can easily introduce artificial errors and have poor generalization ability. In addition, it is difficult to establish complex data-mapping mechanisms for these shallow models; therefore, they have certain limitations.

In recent years, deep learning has been widely used for fault diagnosis [16, 17]. Compared to traditional bearing fault diagnosis methods, deep learning does not require an artificial feature extraction process. Instead, it automatically extracts the representative information and sensitive features from the original data. It has a strong learning representation ability and excellent recognition effect. Therefore, it has several advantages in the field of fault diagnosis. For example, deep discriminative transfer learning network (DDTLN) [18], maximum mean square discrepancy (MMSD) [19], relationship transfer domain generalization network (RTDGN) [20], convolutional neural network (CNN) [21], recurrent neural network (RNN) [22], generative adversarial networks (GAN) [23], autoencoder (AE) [24], ResNet [25], etc. Among them, CNN, as a typical deep learning model, has a strong feature extraction ability due to its unique local connection, weight sharing, pooling operation, and other characteristics; therefore, it has received extensive attention from researchers [9, 16, 26, 27].

Generally, to improve the performance of a CNN, it is necessary to stack convolutional layers continuously to obtain deeper convolutional networks. However, as the number of network layers increases, the number of parameters increases dramatically, and the computing resources required are considerable. What is more terrible is that when the number of layers is too large, the model is prone to the problems of gradient disappearance and overfitting in the process of back propagation, making it difficult for the model to converge, leading to a decline in the identification accuracy [28].

To solve the above problems, Google proposed Inception-ResNet [29], which not only solved the problems of gradient vanishing and loss value increasing but also deepened the network and achieved higher recognition accuracy, receiving widespread attention from experts. For example, Liu et al. [30] achieved transfer learning based on the Inception-ResNet-v2 model by converting raw data into RGB images. Li et al. [31] proposed a bearing fault diagnosis method that combined fault signal spectrum images with the Inception-ResNet-v2 model, achieving good classification accuracy. To address the issue of important feature loss, Jigyasu et al. [32] developed a two-dimensional (2D) image dataset using time-frequency methods and utilized the Inception-ResNet-v2 model for effective feature extraction. Deveci et al. [33] compared commonly used time-frequency images in bearing fault detection and analyzed which time-frequency methods can more clearly display fault features. Peng et al. [34] replaced the convolution module in Inception-ResNet-v2 with depthwise separable convolution to extract fault features under different receptive fields. Zheng [35] improved the feature extraction layer in the Inception-ResNet-v2 structure, improving the detection accuracy of the network for small-scale targets. Liu et al. [36] proposed a transfer learning method based on Inception-ResNet-v2. By studying different methods for converting one-dimensional (1D) signals into 2D graphics, the best method for structural health monitoring was found. Song et al. [37] reduced the computational complexity of the model by segmenting the input image and replacing the convolution module in Inception-ResNet with depthwise separable convolution. Das et al. [38] introduced a multimodel-integrated network based on Inception-ResNet-v2, which achieved high accuracy. Kasireddy et al. [39] developed a binary classification model based on Inception-ResNet-v2 and a small Inception-ResNet-v2 model. Meel and Kumar Vishwakarma [40] proposed a multimodal fusion model based on Inception-ResNet-v2, which achieved high recognition accuracy through multiple fusions in the early and late stages.

However, the abovementioned Inception-ResNet network still has the following problems: (1) They are unable to adaptively select convolution kernels according to image features. (2) The Inception-ResNet model has a large number of parameters and a long training time. To solve these problems, this study proposes an integrated deep neural network that combines an improved selective kernel network (SKNet) with an enhanced Inception-ResNet-v2. The main contributions of this study are as follows.(1)Considering that Inception-ResNet-v2 usually has three different sizes of convolutional kernels, such as 3 × 3, 5 × 5, and 7 × 7, a new three-branch SKNet (NewSKNet) is designed, so that the network can adaptively select the important features extracted by different sizes of convolutional kernels during the training process;(2)Embed NewSKNet into a depthwise separable convolution network. Then, the improved depthwise separable convolution network is used to replace the convolutional structure in the Inception-ResNet-v2 network to reduce the parameters and thus shorten the training time of the network;(3)A new intelligent bearing fault diagnosis framework built on NewSKNet and enhanced Inception-ResNet-v2 is proposed.

The remainder of this paper is arranged as follows: Section 2 briefly introduces the main theoretical background, including the short-time Fourier transform (STFT), Inception-ResNet-v2 network, and SKNet; Section 3 describes the proposed method in detail; in Section 4, two datasets are used for experimental verification; Section 5 discusses the proposed method. Finally, the results and future work are summarized in Section 6.

2. Theoretical Background

2.1. Short-Time Fourier Transform (STFT)

The STFT can obtain both time and frequency domain features of bearing vibration signals, and the transformed 2D matrix is more suitable for CNN processing [41]. The essence of STFT is the Fourier transform with a window, which is calculated by multiplying a window function , and assuming that the signal is smooth during the short interval of the analysis window. Subsequently, is shifted on the time axis to obtain the spectrum of the entire time domain. The STFT was calculated as follows:where is the input time domain signal and is the analysis window function. It can be seen that STFT is the Fourier transform made by multiplying the input signal by a window function . The variables and are the time and frequency resolutions, respectively, and the formula is as follows:where is the sample length, is the window width, is the window overlap width, is the number of points participating in the Fourier transform, and [] is the rounding down function.

As can be seen from equation (2), the sample length and window width affect the resolution in both the time and frequency domains, thus affecting the transformation effect of the STFT. Generally, the sample length is selected based on the first harmonic of the fault characteristic frequency and the window width of the STFT is selected based on the second harmonic. The fault characteristic frequency can be calculated using the following formula:where and are the frequencies of the rolling element passing through the inner and outer races, respectively; is the rotation frequency of the rolling element; is the rotation frequency of the inner race; is the rotational speed; and are the rolling element diameter and pitch circle diameter, respectively; is the number of balls; and is the contact angle.

2.2. Inception-ResNet-v2

In 2016, Google introduced ResNet to the Inception network, thus proposing the Inception-ResNet-v2. The inception module extracts multiscale features from different receptive fields using convolution kernels of different sizes in parallel computing. The introduction of the ResNet structure can avoid overfitting and network degradation problems caused by deepening of the model layers.

This study is based on the Inception-ResNet-v2 model, which is mainly composed of Stem, Inception-ResNet, Reduction, Average pooling, Dropout, SoftMax, and other modules, as shown in Figure 1. As can be seen, the original pooling operation inside the inception is replaced with a residual connection, which constitutes a new Inception-ResNet module. The Inception-ResNet module comprises three types: A, B, and C. Their structures are similar; however, the size and number of convolution cores are different. The Stem module was used to preprocess the input data to obtain a deeper network structure. The Reduction module was used to change the size of the feature map to prevent bottlenecks. For the detailed structure of the Inception-ResNet modules, refer to [29].

2.3. Selective Kernel Network (SKNet)

In recent years, studies on the mechanism of animal visual nerve action have found that when cats look at objects of different sizes and distances, the size of the receptive field of their visual layer neurons is not fixed but automatically adjusts with the size of the stimulus. Therefore, when constructing a CNN, the size of the convolution kernel should differ for different stimuli. However, existing CNN models generally employ only one type of convolutional kernel in the same layer and rarely consider the role of multiple convolutional kernels.

To solve this problem, in 2019, Li et al. proposed SKNet [42], as shown in Figure 2. The network consisted of three main steps: split, fuse, and selection. In the Split stage, the input image is convolved by two kinds of convolution kernels, 3 × 3 and 5 × 5, respectively; in the Fuse stage, the features calculated in the Split stage are fused by the SoftMax function; finally, in the Select stage, the new feature map is obtained according to the results of the different convolution kernels.

3. Proposed Method

3.1. A New Selective Kernel Network (NewSKNet)

The Inception model can adapt to images of different scales by adding multiple convolution kernels; however, the convolution kernels of each layer have the same weight. Correspondingly, the convolution kernel of SKNet differs in size and parameter weight. In addition, they can be easily embedded into other deep learning models. Therefore, based on SKNet, we designed a new three-branch lightweight embedded module called NewSKNet, as shown in Figure 3.

For a given characteristic graph , three different convolution kernels, , , and , were used for the calculation, and three characteristic graphs, namely, , , and , are obtained. To enable the model to adjust the size of the local receptive field according to the size of the input feature map, the “gate” is used to control the information passing through the three convolution kernels. To achieve this goal, element-wise summation was used to fuse the results of the three convolution kernels.

The statistical information of different channels is then obtained by global average pooling (GAP), where the statistical information of channel is

Furthermore, a compact feature map is generated through a fully connected (FC) layer so that the parameters in the network can be reduced, thus improving computational efficiency.where is the ReLU activation function and denotes the batch normalization. Subsequently, Z is divided into three branches using the SoftMax function.where , denote the convolution kernel weights of , , and , respectively. We then obtain , , and through an element-wise product of the convolution kernel weights.

Finally, the output of the characteristic graph can be obtained by summation.

3.2. Depth-Wise Separable Convolution Embedded in NewSKNet

Depthwise separable convolution is a lightweight network that includes depthwise convolution and pointwise convolution [43]. In depthwise convolution, each convolution kernel is responsible for one channel; that is, each channel is calculated using only one convolution kernel. Because this operation is an independent convolution operation for each channel, it does not effectively use the feature information of different channels at the same spatial location. Therefore, pointwise convolution is typically used after depthwise convolution. Specifically, it was used to combine the obtained feature maps again to generate a new feature map.

In this study, a depthwise separable convolutional network embedded in NewSKNet is proposed, as shown in Figure 4. First, the features of the input image are extracted using , , and convolution kernels. Then, three convolution kernels are used to merge the extracted features. Finally, the NewSKNet proposed in Section 3.1 is embedded into the model to further extract significant features. Specifically, after the depthwise separable convolution network obtains the feature map and NewSKNet uses three different convolution kernels to perform operations, thereby obtaining three feature maps. Then, the weighted probability of each feature map was calculated using the SoftMax function. Finally, the weighted probability with the highest probability ranking was selected, and the weight was multiplied with the corresponding characteristic graph to obtain the final characteristic graph. Thus, after the NewSKNet operation, feature maps with more features can be selected for improved fault classification.

In a traditional CNN, a certain relationship exists between the number of parameters in the network and the number of feature maps, which can be calculated as follows:where is the convolution kernel, , and represent the size of the convolution kernel and the number of channels; represents the input characteristic diagram with dimensions as represents the output characteristic diagram with dimensions are represents the step length; and is the number of output channels. Therefore, the number of parameters in the network is .

The parameter formula of depthwise convolution is as follows:

The parameter formula of pointwise convolution is as follows:where is the convolution kernel; its size is is the number of output channels. The number of depthwise convolution parameters is and the number of pointwise convolution parameters is , so the number of depthwise separable convolution parameters is .

Therefore, the ratio of the number of parameters between depthwise separable convolution and traditional convolution is as follows:

From equation (14), it can be seen that the parameter calculation of depthwise separable convolution is related to the number of output channels and the size of the convolution kernel used. For example, when the convolutional kernel size is and the number of output channels is 64, depthwise separable convolution reduces the computational complexity by about 8-9 times compared to traditional convolution.

3.3. The Proposed Fault Diagnosis Framework Based on SIR-CNN

In fact, when the cat’s visual nerve is stimulated, the size of its receptive field is not fixed but automatically adjusts with the change in stimulation. In contrast, in the CNN model constructed by imitating the visual features of cats, the weight of each layer of the convolution kernel is the same, and the convolution kernel cannot be adaptively selected according to the features of the input image. Therefore, this study designed a novel bearing fault diagnosis method based on the proposed SIR-CNN. The core of the diagnosis method is to design a new three-branch SKNet and embed it into a depthwise separable convolution network. Finally, the convolution structure in Inception-ResNet-v2 is replaced by the improved depthwise separable network.

The model can adaptively select a feature map with more features in the training process and extract the multilayer sensitive features in the input image. The established fault diagnosis method and its application process are shown in Figure 5 and Table 1. The implementation steps are summarized as Figure 6 and Table 2.

4. Experimental Validation

To validate the effectiveness of the proposed fault diagnosis method based on SIR-CNN, this section uses a comprehensive fault simulation test bench and the XJTU-SY bearing dataset to conduct experiments. All experiments were implemented on a PC with a Win 10 operating system, Core i7 CPU, 2.9 GHz, 16 GB RAM, and RTX2060 GPU. The software used in the experiments was Python 3.7, TensorFlow 2.3.

4.1. Case 1: Laboratory-Measured Dataset
4.1.1. Experiment Description and Data Acquisition

The HZXT-DS-001 comprehensive fault simulation test bench can be used to simulate various fault types of rolling bearings. As shown in Figure 7, the test bench was mainly composed of a three-phase motor, acceleration sensor, eddy current sensor, shaft, and coupling. In this experiment, EDM technology was used to machine faults in different parts of a rolling bearing (NSK6308). Two radial (X, Y) and one axial (Z) vibration signals of the bearing were measured by three accelerometers. The sampling rate is 8192 Hz and the sampling time is 10 s. In this way, five common health conditions of rolling bearings were simulated: normal (NL), rolling element failure (RF), inner race failure (IF), outer race failure (OF), and cage failure (CF). Figure 8 shows a real picture of the four types of failed bearings.

During the experiment, the motor drives the rotor to run at 2600 r/min, 2800 r/min, 3000 r/min, and 3200 r/min and then uses the data acquisition system (as shown in Figure 9) to collect the vibration signals of the bearing in different states, which are represented by A, B, C, and D, respectively, as shown in Table 3. At each speed, it includes a rolling element fault signal, cage fault signal, inner-race fault signal, normal signal, and outer-race fault signal. Therefore, five different data samples were formed for each dataset, and their time-domain and frequency-domain diagrams are shown in Figure 10.

To fully utilize the powerful ability of the CNN model in image processing, the 1D bearing vibration signal is converted into a 2D time-frequency diagram through STFT, as shown in Figure 11. Then, the obtained time-frequency graph was divided into training, validation, and testing sets. The training and validation sets were used to train the proposed SIR-CNN and the testing set was used to verify the trained model.

4.1.2. Diagnostic Results and Analysis

To explain the classification effect of the rolling bearing fault diagnosis method based on SIR-CNN in more detail, the four datasets in Table 3 were used for the experiments, and the confusion matrix was used to display the experimental results. The abscissa of the confusion matrix represents the predicted category label and the ordinate represents the real category label. The number on the diagonal line indicates the number of samples correctly classified for each sample type. The larger the number, the better is the classification effect of the model for this type of sample. The number outside the diagonal indicates the number of samples of one type incorrectly identified as another. Figure 12 shows the confusion matrix of the proposed method for the four datasets. The diagnostic accuracies were 99.6%, 99.8%, 100%, and 99.2%, with the average diagnostic accuracy of the four datasets at 99.65%. This shows that the proposed method has a strong feature learning ability on the four datasets and achieves a high classification accuracy.

4.1.3. Performance Comparison of Different Methods

To further validate the advantages of the proposed method compared with other methods, this section selects shallow models, a conventional CNN model, and deep models based on Inception were used for comparative experiments. (1) Compared with shallow models. First, 15 features of the vibration signal are extracted [44] and then input into the SVM and BPNN for fault identification. Among them, the kernel function of SVM adopts Gaussian radial basis function (RBF), C = 10, gamma = 0.015. The hidden layer structure of BPNN is (32, 16), and the activation function is ReLU. (2) Comparison with the conventional CNN. The specific structural parameters were as follows: , where represents the convolutional layer, with a kernel size of and a quantity of n; represents the pooling layer, with a pooling size of and a quantity of k; FC represents the fully connected layer (3) Compared to the deep model based on inception. In this experiment, Inception-v4, Inception-ResNet-v2, Inception-ResNet-v2+SKNet, and Inception-ResNet-v2+NewSKNet were used. The sample data in Table 3 are input into the model for iterative training. To reduce the impact of randomness, each experiment was conducted ten times.

As shown in Figure 13, the diagnosis accuracy of the shallow networks (BPNN and SVM) is low. This is because the BPNN and SVM need to manually extract features before training. However, the method of manually extracting features will miss some important information, which will affect the recognition accuracy. Correspondingly, several other methods have obtained high diagnostic accuracy, among which the proposed SIR-CNN has the best classification performance. This is owing to the improved SKNet embedded in the SIR-CNN model, which enables the model to automatically extract more important features during training.

To verify the antinoise performance of the SIR-CNN, Gaussian white noise with different signal-to-noise ratios (SNR) was added to the bearing vibration signal to simulate the actual working state of the bearing. The SNR is defined as follows:where and represent the powers of the original vibration and noise signals, respectively. The smaller the SNR value, the stronger the noise interference. 70% of the dataset with added noise was divided into training samples, 20% into validation samples, and the remaining 10% as testing samples.

Figure 14 shows the average diagnostic accuracy of the various methods under different SNR interferences. With a continuous increase in noise, the fault diagnosis accuracy of various methods also decreases correspondingly. The influence of random noise on the SIR-CNN model is relatively small, and its antinoise performance tends to be stable. The shallow models (BPNN and SVM) are more susceptible to noise interference. When (dB), the diagnostic accuracy was the lowest. Meanwhile, other methods are also affected by random noise. In conclusion, compared with other diagnostic methods, the SIR-CNN has the best antinoise performance.

4.1.4. Comparison with Other Preprocessing Methods

In order to obtain the optimal STFT, four different types of window functions and window widths were used for comparative analysis, and the experimental results are shown in Figure 15. It can be seen that when the window function is Hamming window and the window width is 64, the recognition accuracy is the highest, reaching 99.91%.

Two data preprocessing methods, Wigner–Ville distribution (WVD) and continuous wavelet transform (CWT), are used to compare with STFT to verify the superiority of STFT-based SIR-CNN. As shown in Figures 11 and 16, the time-frequency map obtained by STFT contains the least amount of information compared with WVD and CWT, and some important fault features may be lost. However, it reduces invalid noise and interference to some extent, and the differences between fault types are more obvious. Correspondingly, the time-frequency map obtained by CWT retains more time-frequency information and therefore is more susceptible to noise [42]. The time-frequency map generated by WVD is of low quality, with low similarity to the original samples, and it is difficult to distinguish between fault types, which increases the difficulty of the model in feature extraction.

Figure 17 shows the test results of three preprocessing methods under different SNR. It can be seen that both CWT-SIRCNN and STFT-SIRCNN achieved high recognition accuracy when the noise interference was relatively small. However, as the noise interference increased, the recognition accuracy of CWT-SIRCNN decreased significantly. Relatively speaking, the impact of random noise on STFT-SIRCNN was relatively small, and its noise resistance performance tended to stabilize. Although WVD-SIRCNN is not easily affected by noise interference, its overall recognition accuracy is relatively low.

4.2. Case 2: The XJTU-SY Bearing Dataset
4.2.1. Experimental Setup

To further validate the effectiveness of the SIR-CNN algorithm and its fault diagnosis framework proposed in this paper, this section conducted experiments using the XJTU-SY bearing dataset [45] provided by Xi’an Jiaotong University (XJTU) and the Changxing Sumyoung Technology Co., Ltd. (SY). As shown in Figure 18, the test rig consists of an AC motor, motor speed controller, support bearing, horizontal accelerometer, vertical accelerometer, and tested bearing. A total of three types of working conditions were designed for the test: (1) 2100 rpm with a radial force of 12 kN; (2) 2250 rpm with a radial force of 11 kN; and (3) 2400 rpm with a radial force of 10 kN. Five bearings were used for each working condition for the test, and pictures of the bearings with typical types of failure are given in Figure 19, where it can be seen that the causes of failure of the test bearings include inner race wear, outer race wear, cage fracture, and outer race fracture. In the test, two accelerometers were used to collect the horizontal and vertical vibration signals of the test bearings, respectively, with a sampling frequency of 25.6 kHz, a sampling interval of 1 min, and a sampling duration of 1.28 s each time.

In this experiment, the outer race dataset is selected in Case 1, the inner race and cage dataset is selected in Case 2, and the outer race, inner race, rolling body, and cage composite fault dataset is selected in Case 3, which constitutes the dataset of four fault types, namely, IR, OR, Cage, and IBCO, as shown in Table 4.

Similarly, in order to fully utilize the powerful image processing capabilities of CNN models, 1D bearing vibration signals were converted into 2D time-frequency maps through STFT, as shown in Figure 20. Then, the obtained time-frequency map is divided into training set, validation set, and testing set. The training set and validation set are used to train the proposed SIR-CNN, while the testing set is used to validate the trained model.

4.2.2. Diagnostic Results and Comparison

In order to further verify the superiority of the proposed SIR-CNN model, seven other models are selected for comparison experiments, namely, STFT + CNN, STFT + ResNet, enhanced Inception-ResNet-v2, RGB + Inception-ResNet-v2, SKNet + Inception-v4, SKNet + Inception-ResNet-v2, and NewSKNet + Inception-ResNet-v2. Each method is experimented 10 times to take the average value, and the results are shown in Table 5. It can be seen that the diagnostic accuracies of the first four models are relatively low, while the two models SKNet + Inception-v4 and SKNet + Inception-ResNet-v2 improve their diagnostic accuracies by at least 3.8% compared to the first four models due to the integration of SKNet. Overall, the proposed SIR-CNN obtains the highest recognition accuracy, with a 1.8% improvement in classification accuracy over NewSKNet + Inception-ResNet-v2 and a 12% improvement over STFT + CNN.

In order to illustrate the classification details of the various methods on different fault category samples in more detail, the confusion matrix is used to show the experimental results. Figure 21 shows the confusion matrices of the proposed method and the other seven methods under the XJTU-SY dataset, and it can be seen that the proposed SIR-CNN method has a strong feature learning capability on each fault category dataset, and a high classification accuracy is obtained.

In order to more intuitively analyze the diagnostic performance of different methods under the XJTU-SY dataset, the features extracted by the model are visualized using t-SNE, and the results are shown in Figure 22. It can be seen that the features extracted by STFT + CNN, STFT + ResNet, enhanced Inception-ResNet-v2, and RGB + Inception-ResNet-v2 are all severely confounded; SKNet + Inception-v4, SKNet + Inception-ResNet-v2, and NewSKNet + Inception-ResNet-v2 extracted features are also slightly aliased. In contrast, in the features extracted by SIR-CNN, the samples of the same category are completely aggregated together, and the samples of different categories are completely separated, which again indicates that the SIR-CNN-based fault identification method has a more powerful feature extraction capability and superior classification performance.

4.2.3. The Effectiveness of SIR-CNN

To further validate the effectiveness of model improvement, we compared the number of parameters, network training time, and network testing time between the proposed SIR-CNN and the other four networks. As shown in Table 6, it can be seen that the proposed SIR-CNN has fewer network parameters, which is consistent with the analysis of equation (14). We also noticed that the number of parameters has an impact on the training and testing times of the network.

Next, we also studied the combination effects of different convolutional kernels. To limit the search space, we only used four different convolutional kernels: , , , and . If “NewSKNet” in Table 7 is checked, it means that we use NewSKNet on the corresponding kernels checked in the same row. Otherwise, we simply add the results of these kernels as the output of the model. The results in Table 7 indicate that when the designed NewSKNet is used, lower losses can be achieved, which is attributed to the use of multiple convolutional kernels and the adaptive selection mechanism between them. Due to the fact that lower losses imply higher accuracy, this is consistent with the experimental results mentioned above and further proves the effectiveness of the model improvement.

5. Results and Discussion

As mentioned above, we know that feature extraction is a crucial part in the fault diagnosis process, and its main role is to extract fault-related information from the original signal to match known fault templates to recognize the fault. However, traditional deep learning methods are unable to adaptively select convolutional kernels based on the features of the input image, which results in weak extracted features. In addition, the number of parameters of deep learning models is very large, which usually requires a lot of time to train the models. In order to solve these problems, this paper skillfully fuses the multibranch SKNet, the depthwise separable convolution network and the improved Inception-ResNet-v2 network, and the fused network can extract the features in the input data more effectively and reduce the parameters in the network. Thus, the accuracy of fault diagnosis can be improved and the speed of network convergence can be accelerated. The performance of the method is verified on two datasets. However, there are still some potential problems, which need further improvement and in-depth research.

At high bearing speeds, multiple faults may occur simultaneously or sequentially rather than a single fault. These faults may be correlated or interact with each other, causing the system to exhibit complex fault behavior. However, the present model is tested on two datasets with single faults and has some limitations.

From the empirical values in the literature, the number of convolutional layers, the size of convolutional kernel, the type and size of pooling, and the activation function all have a significant impact on the performance of the model. Therefore, how to choose the parameters of SIR-CNN is a matter to be considered.

It should be noted that the data collected at industrial sites usually contain real noise and are more complex, while the noise data in this paper are modeled by adding Gaussian white noise, which is somewhat different from the real noise data.

The testing of the model is done on two datasets; in the next step of the research work, we need to collect more datasets for further in-depth validation of the effectiveness of the proposed method.

6. Conclusion

To solve the problem that existing intelligent fault diagnosis methods cannot adaptively select the convolution kernel according to different input images, which leads to weak extracted features, this paper proposes an integrated deep neural network, SIR-CNN, which combines proposed NewSKNet and enhanced Inception-ResNet-v2. First, the 1D raw vibration signal is converted into 2D time-frequency diagram using STFT. In this way, more time-frequency features can be obtained, and the powerful image processing capability of the CNN can be fully utilized. Then, based on SKNet, a new three-branch SKNet is designed, and the designed NewSKNet is embedded in the depthwise separable convolution network. Finally, the convolution structure in Inception-ResNet-v2 was replaced by the improved depthwise separable convolution network. The performance of the SIR-CNN was validated on open and measured bearing datasets. Experiments show that the NewSKNet designed in this study can adaptively select convolutional kernels and extract more important signal features, thereby greatly improving the diagnostic accuracy of the model. In addition, as Inception-ResNet-v2 is a complex model with numerous network parameters, we cleverly embed NewSKNet into a depthwise separable convolution and then replace the convolutional module in Inception-ResNet-v2 with a depthwise separable convolution embedded with NewSKNet, which can significantly reduce the parameters in the network and accelerate model fitting.

In the future, we plan to collect more datasets to validate the model and consider introducing transfer learning into the model to further validate the model on real industrial data.

Data Availability

Some or all data, models, and codes used in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China (Grant no. 71861025), the National Key Research and Development Program of China (Grant no. 2018YFB1703105), the Hongliu First-class Disciplines Development Program of Lanzhou University of Technology, and the Fundamental Ability Enhancement Project for Young and Middle-Aged University Teachers in Guangxi Province (Grant no. 2023KY0810).