Abstract

Transformer model is being gradually studied and applied in bearing fault diagnosis tasks, which can overcome the feature extraction defects caused by long-term dependencies in convolution neural network (CNN) and recurrent neural network (RNN). To optimize the structure of existing transformer-like methods and improve the diagnostic accuracy, we proposed a novel method based on the multiscale time-frequency sparse transformer (MTFST) in this paper. First, a novel tokenizer based on shot-time Fourier transform (STFT) is designed, which processes the 1D format raw signals into 2D format discrete time-frequency sequences in the embedding space. Second, a sparse self-attention mechanism is designed to eliminate the feature mapping defect in naive self-attention mechanism. Then, the novel encoder-decoder structure is presented, the multiple encoders are employed to extract the hidden feature of different time-frequency sequences obtained by STFT with different window widths, and the decoder is used to remap the deep information and connect to the classifier for discriminating fault types. The proposed method is tested in the XJTU-SY bearing dataset and self-made experiment rig dataset, and the following work is conducted. The influences of hyperparameters on diagnosis accuracy and number of parameters are analysed in detail. The weights of the attention mechanism (AM) are visualized and analysed to study the interpretability, which explains the partly working pattern of the network. In the comparison test with other existing CNN, RNN, and transformer models, the diagnosis accuracy of different methods is statistically analysed, feature vectors are presented via the t-distributed stochastic neighbor embedding (t-SNE) method, and the proposed MTFST obtains the best accuracy and feature distribution form. The results demonstrate the effectiveness and superiority of the proposed method in bearing fault diagnosis.

1. Introduction

The rotating machinery plays a pivotal role in modern industrial systems, which are widely used in aerospace engineering, motor industry, manufacturing industry, and other important fields [1]. Bearing, as the core component of rotating machinery, its failure mechanism, especially the monitoring and identification of the faults, has become a research hotspot. The study of compact and effective online condition monitoring and fault diagnosis method is essential and necessary for the operation of complex mechanical systems [2, 3].

Generally, bearing faults diagnosis approaches consist of two categories: model-based [4] and data-driven methods [5]. Model-based methods established fault feature detection and classification model through a large amount of prior knowledge, but the diagnosis accuracy is not satisfactory under complex conditions. Data-driven methods aim to establish complex nonlinear projection relationships between the sensor data and fault types, and they are becoming more and more attractive with the development of big data and the various bearing fault diagnosis algorithms in machine learning (ML). Currently, the most common ML methods utilized in the bearing fault diagnosis include K-nearest neighbor (KNN), support vector machine (SVM) [6], multilayer perceptron (MLP) [7], hidden Markov model (HMM) [8], and variational mode decomposition (VMD) [9, 10]. However, the traditional ML methods can no longer meet the requirements due to its shallow feature extraction and presentation framework. Recently, deep learning (DL) has achieved great success in bearing fault diagnosis owing to its strong model-fitting ability and generalization ability.

On the other hand, deep learning network can conveniently stack and combine learning layers to handle the diagnosis under different equipment and work conditions. Commonly, deep learning approach includes auto-encoder (AE) [11], deep belief network (DBN) [12], convolutional neural network (CNN), and recurrent neural network (RNN). Among these methods, CNNs have attracted more researchers’ attention because it is more suitable for processing periodic signals and have a stronger ability to learn features from mechanical vibration signals [13]. The CNN-based frameworks extract and connect local features of interspaces by sharing convolutional kernels in the deep layers, which guarantees the effectiveness of bearing fault diagnosis. Gao et al. [14] proposed a method based on parameter optimization maximum correlated kurtosis deconvolution (MCKD) and CNN for bearing fault diagnosis, and MCKD is used to filter and denoise the raw signals and then input the results to the CNN model for fault classification. Liu et al. [15] proposed a two-stage framework for rolling bearing fault severity recognition via data mining integrated with CNN, which introduced matrix profile (MP) to mine the impulse from the raw vibration signals and then conducted a CNN that combined with softmax regression for fault recognition. The current relevant works of CNN are carried out in the direction of model structure optimization and combination with traditional ML methods. Researchers attempt to learn more effective features with a more compact and effective structure to avoid problems such as gradient failure in the algorithms [16]. For instance, Wang et al. [17] proposed a squeeze-and-excitation-enabled CNN (SECNN) that can assign a certain weight to each channel and enforce the model focusing on the major features. Xu et al. [18] combined the variational mode decomposition (VMD) method and a deep CNN to develop a bearings fault classification network.

As an effective model in sequence data processing, the RNN network is widely used in bearing fault diagnosis. Researchers proposed the gated recurrent unit (GRU) and long short-term memory unit (LSTM) to solve the problems such as long-term dependencies and gradient vanishing in the vanilla RNN model. The improved RNN models achieve more attractive results than the baseline approach. An et al. [19] employed an RNN framework with LSTM by the idea of an infinitesimal method to realize the intelligent fault diagnosis under time-varying working conditions. Zhang et al. [20] proposed a method based on RNN with GRU and MLP to implement fault recognition, which achieves excellent diagnosis results and exhibits the robustness against the noise. Zhao et al. [21] proposed a complex deep learning model by combing CNN and LSTM, which is denoted as a bidirectional long short-term memory network (CBLSTM). CBLSTM adopted CNN to learn local features and then input the results into a bidirectional LSTM to extract global features. The emerging bearing diagnosis methods based on CNN and RNN continue to mature. However, there are still some inherent defects such as information loss, the receptive filed is too small, and the lack of long-term dependencies in CNN and RNN.

Recently, attention mechanism (AM) is introduced to solve the problems mentioned previously. AM can associate different positions or channel features of a sequence and pay more attention to the informative data, which is designed as a component combined with CNN or RNN and widely applied in various tasks such as natural language processing (NLP), computer vision (CV), and fault diagnosis [22]. AM enhanced the performance of the backbone of CNN or RNN but failed to completely avoid the shortcoming of these classical models. Furthermore, in 2017, Vaswani et al. [23] came up with a new architecture called a transformer, which abandons all the convolutional and recurrent modules and is based only on the attention mechanism and fully connected layers. Transformer attained the best performance in the task of machine translation at that time. The framework BERT based on a transformer proposed by Devlin et al. [24], which is developed to generate word vectors, achieved excellent results in NLP tasks. In the field of NLP, transformer broke new ground and almost entirely replaced RNN at present. In 2021, a pioneering framework based on transformer-named vision transformer (ViT) [25] employed in computer vision (CV) has achieved encouraging performance in image classification tasks. The test results indicate that ViT outperforms other state-of-the-art methods in condition of pretraining on a larger dataset. Meanwhile, ViT showed a strong data extensibility. Its performance continues to improve even as the data amount and model scale increase. Furthermore, the powerful parallelism in the computing of ViT means a greater advantage in large-scale data processing. A variety of modified models that diverge around ViT have been proposed and achieved excellent performance in CV tasks, such as CrossViT [26] and PVT [27]. Clearly, transformer model has been an important branch of deep learning besides CNN and RNN.

The outstanding performance in encoding and extracting hidden features of sequences, which makes the transformer neural network, has been a promising method in the field of bearing fault diagnosis where the vibration data are the main judgment input. Ding et al. [28] proposed a transformer framework named TFT for bearing fault diagnosis, which designed a tokenizer and encoder module to extract abstractions from the input time-frequency representations (TFRs) of vibration signals. BAFT [29] proposed by Jiao et al. developed a partly interpretable network based on transformer and a binary arborescent filter to classify the bearings faults effectively and visually presented the partly hidden features inside the model, which achieved a superior performance and excellent antinoise validity. Jin et al. [30] proposed a time-series transformer (TST) to recognize the bearing fault modes, which designed a sequence generation method that handles raw vibration signals in a 1D format time series segment. The series is then input into the encoder of transformer to learn the features. The test results show that TST has a better fault identification capability than traditional CNN and RNN models. Du et al. [31] proposed a transformer-like framework for fault diagnosis under complex conditions, which extracted the features from the high-dimensional raw signals with noise by a stacked denoising auto-encoders (SDAE) module and obtained the target features by the self-attention mechanism of transformer deep neural network.

The most common data-driven works for bearing fault diagnosis are conducted through the analysis of vibration signals. However, the data that are input into the deep learning model are preprocessed by different approaches in different frameworks. The preprocess methods can be roughly divided into three categories: (1) sampling raw signals or its simple processing results [32]. In time-series transformer (TST) [30], the input vibration time series is trimmed into several subsequences with the given length. Huang et al. [16] proposed a work that applied maximum pooling and average pooling layers to extract different scale information as the input of AM module in transformer; (2) preprocessing by the feature-based model [33]. Du et al. [31] proposed a work that established a stacked denoising auto-encoder (SDAE) module to generate low-dimensional features of input signals. Jiao et al. [29] proposed a framework that developed a binary arborescent filter to extract the statistical feature and then input the encoder module of transformer network; (3) preprocessing by domain transformation. The time-series signals are transformed into frequency representation (FR) [34, 35] or time-frequency representation (TFR) [36, 37]. In TCN [38], the FR that is transformed by a fast Fourier transform (FFT) module from vibration signals is input into transformer network. In TFT [28], the input signals are first processed to 2D TFR by synchrosqueezed wavelet transform (SWT) and then flattened and mapped as the tokenizer of transformer module. In general, the methods based on domain transformation have better performance.

As mentioned previously, the transformer-like approaches have achieved excellent performance in bearing fault diagnosis due to the powerful modelling and feature extraction ability of the self-attention mechanism in transformer. However, there are some limitations in the existing transformer-like bearing fault diagnosis models:(1)Almost all methods only use part components of the transformer framework, which weakens the model’s ability to sequence information(2)Ignoring the interference of secondary information in self-attention weights can reduce the performance for fault diagnosis of transformer

The motivation of this paper is to develop a new transformer-based method that can extract more effective hidden representations for bearing fault diagnosis in a simple and generalized way. The proposed new end-to-end approach named multiscale time-frequency sparse transformer (MTFST), which established the diagnosis model between the TFRs and bearing fault types. MTFST achieves good results in evaluation; furthermore, its superiority over the other deep learning models is proved on the test datasets. The main contributions of this paper are summarized as follows:(1)The STFT method is employed to obtain the multiscale TFRs of raw vibration signals by varying the window width, and the novel tokenizer based on the different scale TFRs is designed to present the discriminant feature in multilevel.(2)A sparse self-attention mechanism (SSAM) is studied to focus on the primary information of self-attention, enabling the hidden features to be more discriminative.(3)A novel encoder-decoder structure is developed to extract the hidden features and long-term dependence of multiscale TFRs. The proposed framework is more compatible with the vanilla transformer than the existing models and better at fault diagnosis. And the visualization analysis of model weights solves the problem that traditional deep learning fails to interpret to some extent.

The rest of the paper is arranged as follows. The theoretical foundations of transformer are introduced in the second section. Structural framework and algorithmic flow of the proposed MTFST are introduced in the third section. The fourth section includes the introduction of the dataset, experiment setting, and the ablation analysis of hyperparameters, and bearing fault diagnosis results under two datasets are also evaluated and analysed in this section. The conclusions of the paper and future research plan are given out in the last section.

2. Preliminaries

This section will introduce the basic structure and core components of the vanilla transformer.

2.1. Transformer

The transformer framework was proposed by Vaswani et al. [23] to optimize the traditional patterns of Seq2Seq. The novel structure is entirely based on the attention mechanism to draw global dependencies between the input sequence and output results, which solves the problems of difficulty to model the global relationships between local information in traditional convolution operation. Furthermore, this model increased parallel efficiency and reduced computing consumption. The overall architecture of the transformer is shown in Figure 1, which consists of encoder and decoder modules, and those two components are stacked by multiple basic transformer blocks. The basic transformer blocks include multihead self-attention mechanism, position-wise feed-forward network, layer normalization module, and residual connector. The embedding layers before the encoder and decoder convert the one-hot tokens into a new tensor, and the tensor is added to a sinusoidal position encoding. The encoder in the transformer receives the input sequences, and the decoder remaps the output of encoder to obtain the results.

In encoder, blocks are stacked, and those blocks are taking same structure but different parameters. The block consists of two layers, and the first layer includes a multihead attention module and a residual network. The second layer includes a positionwise fully connected feed-forward network and a residual network. The output of each layer can be presented as follows:where and denote the output and input of each layer, respectively. denotes multihead self-attention or positionwise forward network. LayerNorm is the layer normalization.

There are three layers in decoder, and they consist of the basic transformer blocks presented before. The first layer is masked multihead attention that is used to extract the hidden feature of the input sequence with AM and attached mask coding, which can prevent label leakage in the Seq2Seq task [29]. The second layer is employed to map the output from the first layer of decoder and encoder. The positionwise forward network and residual operation are used in the third layer to extract the local and global deep information. Finally, the output of the decoder inputs into a linear layer and a softmax activation function to obtain the probabilities.

2.2. Multihead Self-Attention Mechanism

The multihead self-attention mechanism (MSA) is built based on self-attention mechanism, which is the core component of transformer model and employed to gather the information from input sequence and learn the hidden features. Self-attention mechanism can be regarded as a method that maps the different weight information of input sequence, which obtains the output from a query (Q), a set of key (K), and a value (V) vector. As shown in equation (2), the output of self-attention mechanism is a weighted sum of V, and the weight matrix is related to the dot product of K and V.

It can be seen that the self-attention extracts information in V based on the similarity between K and Q.where denotes the scaled factor, it is the dimension of Q and K, and softmax denotes an activation function.

In order to obtain different subspaces of hidden information rather than only one nonlinear transformation result, the multihead self-attention is proposed to concatenate and map the input tokens’ different projections that are parallel computed by multiple independent self-attention mechanisms. The calculation process is shown as follows:where denotes the number of heads that is the number of self-attention module. , , and denote the weight matrix of Q, K, and V in th self-attention module, respectively, and . denotes the embeddings. designed as a concatenate function. denotes the weight matrix of linear projection on concatenated multi-head.

2.3. Positionwise Forward Network

The feed-forward network is a fully connected layer, and it includes two linear transformations and a ReLU activation, which is expressed as follows:where , , , and denote the weights and bias of two linear transformations, respectively.

2.4. The Transformer-Like Methods for Bearing Fault Diagnosis

The vanilla transformer employed an encoder-decoder structure to solve the Seq2Seq tasks as mentioned above. The encoder in the framework receives the embedding information for feature learning. The decoder is employed to generate a new sequence through the encoder output and the last layer’s result of the decoder itself. Generally, transformers divided into three categories include (1) encoder-decoder (e.g., for Seq2Seq), encoder only (e.g., for classification), and decoder only (e.g., for language modelling) [28]. Existing transformer-like approaches usually adopt the encoder-only model for the fault diagnosis tasks, such as the TST [30], TFT [28], and BAFT [29] mentioned previously, the series are embedded to token sequence with class information and then input to the encoder, and the hidden features with category information are mapped by the classifier to obtain the fault types. In PRT [32], the framework adopts an enhanced encoder network that includes an embedding patch encoder and a class information encoder to learn the hidden features and dependencies. In the framework proposed by Du et al. [31], which remapped the forward eigenmatrix by a position transformation to obtain a backward eigenmatrix, two feature matrixes are input into the pair attention-mechanism neural networks to better learn the essential characteristics of the fault data. And the network is closer to the vanilla transformer in structure, but the different attention-based neural modules lack feature interaction. In essence, for this network, the encoder that is used to extract the features of the backward eigenmatrix can be considered as a learning enhancement module of another one.

3. Multiscale Time-Frequency Transformer

In this section, the proposed multiscale time-frequency transformer (MTFST) will be introduced in detail. The core components include tokenizer, encoder, decoder, and classifier.

3.1. Tokenizer
3.1.1. Raw Signal Preprocessing

Vibration signals that are sampled from sensors are 1D time series; in our work, the input raw signals will be processed to 2D format TFRs. Thus, a specific tokenizer based on shot-time Fourier transform (STFT) was designed. STFT is a domain transform method based on the windowed Fourier transform algorithm, which assumed that the signals to be processed are stationary for short intervals in the analysis window. By moving the window function along the time axis, STFT analyzes the signal segments to obtain the local spectrum [39]. STFT is defined as follows:

Given a certain TFR , where and denote the length along the time and frequency axis, representing it as a patch sequences , where the subsequence . For consistency of subsequent operations, the token embeddings are obtained by projecting the TFR sequences to another by a linear transformation. The process is expressed as follows:where , which represents the learnable linear mapping of TFR along the time axis, and .

Finally, the TFRs are discretely represented as temporal sequences of the instantaneous frequency spectrum. As mentioned above, processing such a sequence is the strength of a transformer-like structure [28].

3.1.2. Position Encoding

The vanilla transformer framework designed position encoding to represent the relative or absolute position information of the embedding sequence.

There are two methods including 1D and 2D format position encoding [30], and the results in reference [25] show that the two methods have no significant performance gaps. In our proposed work, a kind of sinusoid encoding method was adopted only to mark the location information of the sequence, which is expressed as follows:where denotes the position of the patch among the sequence. and denote the dimension of the position vector and the current dimension, respectively.

Finally, the tokens sequence is defined aswhere .

On the other hand, there are two main ways to represent the deep features extracted from the tokens sequence, including obtaining the information of the last transformer layer or learning the feature by adding a class token into the sequence [30]. The comparison results in reference [40] show that the class tokens are nonessential. Therefore, the class tokens are abandoned, and another method of expressing characteristics is employed in our work.

3.2. Sparse Self-Attention Mechanism

Sparse self-attention mechanism (SSAM) is designed to eliminate the reduction of feature discrimination caused by focusing on the secondary information. Inspired by the authors in reference [41], in the SSAM, each attention feature of TFRs is determined by the top P input information that is most similar to it, which differs from the naive SAM that calculates features by all input information. As shown in the middle of Figure 2, the similarities of input K and Q are calculated first, and the indices matrix of P largest elements are selected to mask the softmax results. The calculation examples are shown in right in Figure 2. The sparse self-attention is defined as follows to replace equation (2).

And the sparse ratio is defined as follows:where and denote the size of the attention weight matrix.

3.3. Encoder and Decoder

The proposed MTFST employed three encoders to extract the multiple perspective deep features from the multiscale TFR embeddings, and the encoders share the same structure as the one in vanilla transformer described in Section 2 but are different in parameters. Encoder consists of the basic blocks that include a multihead self-attention module, feed-forward layer, and normalization layer with residual connector.

The decoder is used to extract the dependencies and fuse the corresponding information from the outputs of the different encoders. The structure of the proposed decoder is different from the vanilla transformer and similar to the encoder. Note that the decoder takes the output of different encoders as the input.

3.4. Training of MTFST

As a general deep learning scheme, the proposed MTFST framework adopts labelled fault datasets for supervisory training, and an error back-propagation (BP) algorithm is employed to minimize the loss. For a given training dataset , which contains samples, the loss function is defined as follows:where and denote the prediction output and the ground truth of sample , respectively. denotes the trainable parameters of the network, and presents the cross-entropy loss function.

Additionally, the Adam optimizer [42] is employed to train MTFST, which adopts an adaptive and exponential smoothing gradient strategy to accelerate the loss convergence. As similar with the vanilla transformer, the dropout training manner [43] is employed in the network, which randomly masks the connections of some neurons to reduce overfitting. Algorithm 1 shows the training step of MTFST, and the architecture is shown in Figure 3.

4. Case Study and Analysis

In this section, two case studies are implemented to verify and analyze the effectiveness of the proposed MTFST in rolling bearing fault diagnosis. The data collected from XJTU-SY open dataset and self-made mine motor traction dataset are employed for testing and comparison with other state-of-the-art methods. All the validation experiments are conducted on a computer with a Intel 10700F CPU, a NVIDIA RTX 3080 GPU with 32GB RAM. Besides, Ubuntu 18.06, Python 3.6, TensorFlow 2.6, and CUDA 11.02 are adopted for the whole network construction.

Input: Three multiscale TFRs , , where , , , and , which denote the fault types.
(1)Set training batch , training epoch max_epoch, token embedding dimension , self-attention weight matrix size , number of head , positionwise forward network weight matrix size , block number of encoder , block number of decoder , and number of fault types .
(2)Initialize trainable parameters of MSTFT
(3)for epoch in 1, 2, …, max_epoch do
(4)for step in 1, 2, …, max_step do
(5)  //Tokenizer
(6)  for each in , in and in do
(7)   Reshape , , to  = ,  =  and  =  then slice into patches sequence [], [], [];
(8)   Add position encoding, obtain , , ;
(9)  end Stack batches, obtain sequences , , .
(10)  //Encoders
(11)  for in 1, 2, …, do
(12)   ,
(13)   ;
(14)   ,
(15)   ;
(16)   ,
(17)   .
(18)  end
(19)  //Decoder
(20)  for in 0, 1, 2, …, do
(21)   If (block = = 0)
(22)    ,
(23)    ;
(24)   else
(25)    ,
(26)    ;
(27)  end
(28)  //Classifier
(29)   Obtain feature matrix ;
(30)  ;
(31)  ;
(32)  Batch loss ;
(33)  Calculate gradients , ;
(34)  Update parameters , ;
(35) end
(36)end
Output: Weights and biases
4.1. Case 1: XJTU-SY Bearing Dataset
4.1.1. Dataset Description and Experiment Settings

XJTU-SY bearing datasets are provided by Xi’an Jiaotong University (XJTU) and the Changxing Sumyoung Technology Co., Ltd. (SY). The datasets contain complete run-to-failure data of 15 rolling element bearings that were acquired by conducting many accelerated degradation experiments [44]. The testbed of rolling element bearings is shown in Figure 4, and the vibration signals collected by 5 bearings under three operating conditions include (1) 2100 rpm (35 Hz) and load rating of 12 kN; (2) 2250 rpm (37.5 Hz) and 11 kN; (3) 2400 rpm (40 Hz) and 10 kN. The sampling frequency is set to 25.6 kHz, and a total of 32768 points are recoded for each sampling. In our work, the recoded data of Bearing 2_1, Bearing 2_5, Bearing 3_3, Bearing 3_4, Bearing 1_4, Bearing 2_3, Bearing 3_1, Bearing 3_4, and Bearing 3_2 are employed in the experiment, which includes four fault types: inner race (IR), cage, out race (OR), and inner race, ball, cage and out race (IBCO). In addition, the batch size is set to 40, and the Hanning window is used for STFT.

4.1.2. Ablation Study

In this section, we will discuss the influences of the hyperparameters settings on the diagnosis performance of the proposed model. The hyperparameters contain three STFT window widths , , for obtaining multiscale TFRs, token embedding dimension , and also indicate self-nonlinear transformation dimension of self-attention module, hidden layer dimension of feed-forward network, number of attention head , block number of encoder , block number of decoder , dropout rate , and sparse ration of sparse self-attention. Each model with different parameters trained for 5 runs and the test performance is displayed in Table 1, where the baseline row denotes the model used in the following experiments. It can be seen from the table that the STFT window width can significantly affect the performance of MTFST. Excessive emphasis on the precise scale of the TFRs in time or frequency will reduce the performance of the model. It is worth noting that the input order of the TFRs with different window scale can greatly affect the model’s performance.

The test shows that the TFR features corresponding to small window width have better effectiveness as the input of in decoder. Dimension and can obviously affect the number of parameters in the network. There is a consistent trend that a number of attention head and block number of encoder and decoder have a strong impact on the model performance, which means too small number leads to learning insufficient features, while overfitting is occurred in a too large value. The appropriate setting value of sparse ratio can avoid the interference of secondary features and effectively improve model performance, while overfitting has occurred in a too small . The corresponding accuracy statistics under different hyperparameters are shown in Figure 5.

4.1.3. Diagnosis Results Based on MTFST

In this section, the model established based on the hyperparameters selected in the previous section is trained and tested on the XJTU-SY bearing dataset. The dataset is divided randomly to 80% train data and 20% test data. In each batch, the data with different fault labels used for training and testing are evenly distributed but randomly shuffled, and the data in the fault datasets with small samples were repeatedly used during training. The proposed MTFST is trained 30 epochs to learn a robust diagnosis model, and the training process is repeated 10 runs under the same condition to eliminate the effects of the random initialization.

The variation of loss and accuracy under the XJTU-SY dataset during the training process is shown in Figure 6. It can be seen from the boxplot results of the training set that some of the wobbles occurred in both loss function value and diagnosis accuracy in the early training stages, while the performance improves significantly after 5 epochs and becomes stable after 20 epochs. It indicates the good convergence of the model under the strategy of gradient back-propagation. With the iteration of training, the performance in the test set is improving. Although some fluctuations still occurred in the accuracy, there is no great gap between the top accuracy and minimum accuracy, and the high and stable average accuracy presents the effectiveness of MTFST. These analysis results indicate that the proposed MTFST has strong and robust model fitting ability and generalization. To further analyze the model performance, the fault diagnosis of the confusion matrix of top accuracy and minimum accuracy is presented in Figure 7, which is sorted from the results of 10 rounds of repetitive training process. The rows denote the ground truth of the samples, and the columns represent the predicted fault labels of the MTFST.

4.1.4. Visualization of Network

In this section, first, the attention weights are visualized to attain a further understanding of how MTFST works. Instead of the class token in conventional transformer architecture, the deep hidden features extracted by the attention mechanism are mapped directly to the diagnosis results in MTFST. Thus, the attention weights could reflect the relationships of the deep TFRs patches in each attention mechanism-based layer; furthermore, these relationships can represent which features are considered valid and which are redundant. The attention weights, i.e., the results of , are calculated and concatenated in the multihead self-attention network, and the weight matrixes are averaged to show the attention level. We list the partly weights representation in encoders and decoders of different fault labels in Figure 8. As seen from all the first layers of encoders, there is sparse attention in certain areas and little attention weights between patches. However, the network gradually assigns more attention weights to patches with significant characteristics layer by layer. In the last layer, there are strong weights between different patches and the attention focus on fixed regional deep features. Furthermore, the encoders with different scale TFRs input work into distinctive areas to grasp the complementary information. There are the same trends regardless of the different labels that the tokens around 40 in encoder1, tokens between 1 and 22 in encoder2, and tokens between 35 and 80 in encoder3 are the most active. This removes the suspicion that the multiple encoders in the network would generate redundant features. In the decoder module, a multilayer attention mechanism is employed to remap the deep features and connect the classifier. As shown in Figure 8, the tokens between 1 and 15 are the most active in the first layer, and the strong weights can fuse the output information of different encoders and focus on the relationships between the feature patches corresponding to the different window widths while the active attention tokens between 35 and 60 in the last layer can extract the distinct components on which classification decisions are made. It should be noted that, since it is a compound fault type, the larger span of the salient tokens is presented in the decoder attention map of IBCO than others, which is similar to the human reasoning logic.

Second, the distribution form of the feature vectors in the embedding space also presents the working pattern of MTFST. In Figure 9, the feature vectors extracted from encoders and decoders are visualized via t-SNE, which nonlinearized high-dimensional features to two-dimensional vectors to visualize the clustering degree of fault types. It can be seen that the visualization results of raw signals lack clear boundaries for fault type identification, resulting in classification failure. Figure 9(b) presents the results of TFRs possessing linear separability, but there is a large number of overlapping areas in features; hence, it is hard to make classification with high accuracy. The results of encoders in MTFST are shown in Figures 9(c)–9(e), and we can observe that there are obvious decision boundaries between different fault types in the features generated by multiscale TFR encoders, which presents the effectiveness of encoders in coding discriminative class tokens. Nevertheless, many tokens in the encoders are still inevitably misclassified, and the distribution discreteness in the same fault type is needed to be improved. The class tokens in decoder, which is the final output of MTFST, are shown in Figure 9(f). It is obvious that the distribution has well-defined interclass boundaries and compact intraclass distance, which illustrates that the decoder fuses the information of each encoder and further improves the ability in extracting and expressing hidden features of MTFST.

4.1.5. Comparison with Other Methods

In this section, the proposed network is compared with other deep learning methods to demonstrate the MTFST’s effectiveness further. Among these methods, raw signals and TFR-based networks are adopted, including CNN [45], CNN-LSTM [46], Bi-LSTM [47], and WRN-16-2 [48]. Furthermore, the up-to-date transformer-like methods TST [30], TFT [28], and BAFT [29] are employed for the comparison. The parameters of the above methods are set as in the original papers.

The dataset is randomly divided into training set and test set, and the train/test ratio is set to 0.8/0.2. The diagnosis results of different methods are listed in Table 2. In general, TFR-based methods obtain better performance than vibration signals-based methods.

The proposed MTFST achieves the best average accuracy with 99.34%. In addition, t-SNE is also used to investigate the effects of fault feature extraction and the representation ability of different models. The tests are closest to the average accuracy of each network as an example. As shown in Figure 10, the hidden features extracted by MTFST possess the best intraclass compactness and interclass separability. The results denote that MTFST achieves prime fault diagnosis performance in the XJTU-SY dataset.

4.2. Case 2: Self-Made Experimental Rig

A proprietary uniaxial rolling gear test rig is used to simulate the different working conditions in our experimentation, which contains a motor, coupling, test bearing, adjustable magnetic loader, acceleration sensor, and data acquisition system as shown in Figure 11. In the test, a three-phase asynchronous motor commonly used in mine water pump and small hoist is employed. The rated power of the motor is 2.2 kw, and working speed is 1430 r/min. The magnetic loader is controlled by an NX6000 dynamometer, and working load torques are set to 2 Nm, 5 Nm, and 10 Nm. A set of deep-groove ball bearings (SKF-4306) with different defects are used for vibration monitoring as listed in Table 3. As shown in Figure 12, there are 6 bearing states containing single defect and compound defect, to simulate the faults frequently occurring in mining machine. For these defects, they are created by a linear cutting machine, and inner race, out race, and ball are defected with a same size of 1 mm width and 0.5 mm depth, and cage is cut radially. The vibration signals under different working conditions are acquired by a piezoelectric accelerometer CYQ9250 and amplified by a data collector NI USB-6009 with a sample frequency of 10 kHz and 20 minutes duration. Finally, the signals are randomly split into train and test sets with a sample size of 500 and 120 under per condition, and each sample contains 20,000 points.

The proposed MTFST and other deep networks are tested in the self-made experiment dataset to validate the effectiveness and superior performance of our method. The hyperparameters of MTFST are shown in Table 4. Again, the experiment of each model is conducted for 5 runs to exclude the effect of the randomness of the data. The average accuracy of the test set under different defect types and working conditions is listed in Table 5. The accuracy boxplots in different methods are shown in Figure 13, which tailed the results of 5 runs. It can be seen that the accuracy in single defect diagnosis is generally higher than that of compound defect types. In general, the proposed MTFST obtains better performance than other comparative groups. As shown in Figure 14, the t-SNE of extracted features is used to further illustrate the performance of MTFST. Note that, for each model, the test under working conditions of 10 Nm and closest to the average accuracy is used as examples. From the results, it can be observed that MTFST is better at learning to distinguish the hidden characteristics of fault types. The confusion matrixes are shown in Figure 15 and further present the diagnosis results of MTFST in the test example.

5. Conclusions

In this paper, a novel transformer-like bearing fault diagnosis network that processes the TFRs of raw vibration signal is established. The XJTU-SY and self-made experiment rig datasets are used to verify the effectiveness, and the diagnosis results of some existing networks based on CNN, RNN, and transformer are analysed in the experiment as the comparison groups of the proposed MTFST. The main conclusions are as follows:(1)The novel tokenizer based on TFRs that obtained by different window widths STFT is designed, which can code the multiscale complementary TFR information to grasp more discriminative features.(2)The designed sparse self-attention mechanism (SSAM) can effectively eliminate the interference of secondary information and obtain a better performance than naive self-attention mechanism.(3)The proposed MTFST discards the recurrence structure and convolutional operations and focuses on the multihead attention mechanism, which improves diagnostic performance and has partial interpretability. Furthermore, the encoder-decoder framework of MTFST is closer to the vanilla transformer and better in extracting hidden features than existing transformer-like algorithms.

Experiment results indicate that MTFST can effectively detect rolling bearings faults, which extends the kind of diagnosis methodology based on transformer. Future research will focus on the following aspects to ensure the further improvement. First, CNN models and transformer are integrated to enhance model performance by adding the small-field features. Second, the adaptive STFT window widths and sparse ratio of SSAM can be studied to improve the generalization. Third, the method can be tested on different rotating machines and application scenarios, such as gearbox and remaining useful life (RUL) estimation.

Data Availability

The data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors would like to thank the Youth Program of the Education Foundation of Guizhou Province, China (No. QianJiaoHeKYZi[2020]125).