Abstract

Aiming at the problem that the existing methods in the big data environment cannot extract the emotional features of microblog sufficiently and the average accuracy of analysis results is low, a microblog emotion analysis method using deep learning in spark big data environment is proposed. First, the Jieba word segmentation method is used to process text comments, so as to reduce the interference of irregular grammar and nonstandard words on the emotion analysis task of microblog text. Then, features based on affective rules, unary word features, syntactic features, and dependent word collocation features are selected. In order to prevent the dimension disaster caused by excessive feature dimensions, the feature selection method of information gain is used to reduce the dimension of features. Finally, a microblog emotion analysis method based on deep belief network (DBN) is established, and the DBN is parallelized through spark cluster to shorten the training time. Experiments show that when the feature set is composed of TOP2000 features, the classification accuracy of the fusion of four features is 90.94%, which is higher than that of the comparison method. In addition, the training time of DBN algorithm parallelized by spark cluster is only 27.78% of that of single machine. Therefore, compared with the comparison method, the proposed method can significantly improve the performance of the microblog emotion analysis system.

1. Introduction

The development of network technology makes users communicate more and more frequently online, including blogs, forums, and e-commerce website comments [1]. Users express their feelings about certain events or things by publishing information. Analyzing the words in social networks can help the government and other management institutions understand the social mood fluctuations, conduct public opinion analysis, further judge the development of the situation, give reasonable guidance, and maintain social stability [25]. From a commercial perspective, with the rise and popularity of e-commerce platforms such as Taobao and Amazon, users can give product evaluation after purchase, making the information of purchasing products more transparent. The quality of product comments will greatly affect users’ purchase desire. Therefore, businesses can analyze users’ comments, improve goods or change sales strategies in time, further analyze users’ consumption characteristics and hobbies, draw user portraits, and make decisions to maximize business profits. In addition, emotion analysis can also be used to predict the stock market, election support, and other fields [69]. It can be seen that emotional analysis has important value in the fields of society, business, politics, and management [10].

With the rapid development of domestic microblog, many netizens participate in the discussion of various events, from personal trivia to enterprise marketing, and then to global major events. Microblog has become a social platform for public opinion release. Through the analysis of user emotion in microblog, it is of far-reaching significance for the development of government, enterprises, and individuals [1115].

According to the granularity of research, emotion analysis tasks can be divided into three categories: document level, sentence level, and aspect level [16]. Document level emotion analysis regards the whole document as a basic unit and believes that a document as a whole only expresses one polar emotion. However, the document contains multiple sentences, and different sentences may have different emotional polarity classifications [1719]. Sentence level emotion analysis is more fine-grained than document level, which is used to classify the emotional polarity of a single sentence. Aspect level emotion analysis is different from document level and sentence level affective analysis. It will more finely consider the emotion polarity and the target of corresponding emotion. The target here is attribute words or aspects, which usually exist in the form of entity or entity characteristics [20].

Aiming at the problem that the existing methods in the big data environment cannot extract the emotional features of microblog sufficiently and the average accuracy of analysis results is low, a microblog emotion analysis method using deep learning in spark big data environment is proposed. The main innovations are as follows:(1)Jieba word segmentation method is used to process text comments, which effectively reduces the interference of irregular grammar and nonstandard words on the emotion analysis task of microblog text(2)The feature dimension reduction operation is carried out by using the feature selection method of information gain to prevent the dimension disaster caused by too large feature dimension(3)A microblog emotion analysis method based on DBN is established, and the DBN is parallelized through spark cluster, which effectively shortens the training time of the model

The rest of the sections are arranged as follows: Section 1 is related work, which introduces the current research status of emotion analysis. In Section 2, the structure and principle of deep confidence network are described. Section 3 describes deep belief network. In Section 4, the proposed DBN microblog emotion classification model based on spark parallel optimization is introduced in detail. Section 5 is the experiment. Section 6 summarizes this study.

Deep learning method can better capture the grammatical and semantic features of text, which is a research focus of emotion analysis. Jebbara et al. used the bidirectional gated recurrent unit (GRU) to extract attribute words and specific aspects of emotion and extract features from the text for prediction of sentence labels [21]. Considering the characteristics of part of speech and corpus, Liu et al. proposed a method to complete the task of attribute word extraction by using RNN, which achieved better performance than the traditional system based on conditional random field [22]. In order to overcome the limitation of fixed window size of convolutional neural network (CNN) model and better capture context information, Chen et al. combined with the named entity recognition (NER) task method, proposed a text emotion analysis method based on BiLSTM-CRF model to classify BIO labels of entities in sentences [23]. Yin et al. proposed a long short-term memory (LSTM) model for cross-domain attribute word extraction, which combined the rule-based method to generate the auxiliary label sequence of each sentence [24]. Li et al. incorporated attention into the task of attribute word extraction and aspect category recognition and constructed a truncated historical attention and selective conversion network on LSTM [25]. Wang et al. proposed a GRU-based coupled multilayer attention (CMLA) model to extract attribute words and opinion words [26]. In the learning process, it encoded and decoded the dual propagation of attribute words and opinion words, not just limited to syntactic relations. Zhang et al. proposed a text emotion classification model integrating content features and user features [27]. Jamal et al. proposed a Twitter emotion analysis framework based on the Internet of Things, which used the mixed model of term frequency inverse document frequency (TFIDF) and deep learning model for emotion analysis, filtered the original tweets with the tokenization method, so as to capture useful features without noise information, and used TFIDF statistical technology to estimate the importance of local and global features. The adaptive comprehensive class balance technology is used to solve the class balance problem between different emotions [28]. Jelodar et al. used the LSTM method to classify the comments of COVID-19. The research results have a certain impact on the guidance and decision-making of COVID-19-related issues [29]. Wei et al. proposed a BiLSTM model based on multipolarity orthogonal attention for implicit sentiment analysis. Compared with the traditional single attention mechanism model, this method can effectively identify the differences between words and emotional tendencies and has been verified in experiments [30].

3. Deep Belief Network

3.1. DBN Model Structure

DBN is a neural network model with multiple hidden layers. It is difficult to optimize the weight in deep structures such as deep confidence network, so a greedy unsupervised training method is proposed to solve this problem. Figure 1 shows a structure diagram of a deep confidence network with three hidden layers , , and . is the input data and is the output label corresponding to the input data. In the first step, DBN pairs each two adjacent neural network layers, trains the parameters between the two layers with the parameters of the input layer, and constructs the output layer. Moreover, the propagation of input layer and hidden layer is bidirectional, which is divided into forward process and backward process to learn data distribution. This method of building networks between layers is realized by the restricted Boltzmann machine (RBM) model. RBM is a recurrent neural network with two layers. Each node in the same layer is not connected to each other, and the output and input layer nodes are connected symmetrically without direction, which is equivalent to the connection of an undirected graph. An RBM consists of a hidden layer composed of random hidden units and a visible layer composed of random visible units.

Due to the special structure of RBM model, which has connection between layers and no connection within layers, it has the following important properties: when the visible unit state is given, the th neuron in the hidden layer is calculated according to the neuron state of the visible layer, and the activation probability is as follows:where is the sigmoid activation function, represents the th visible unit, represents the th hidden unit, is the weight between the th visible unit and the th hidden unit, and bj is the offset threshold of the th hidden unit.

Similarly, when the state of the hidden unit is given, the probability of the binary state being 1 can be calculated, that is, the activation probability of the visible unit can be expressed aswhere is the offset threshold of the th visible unit.

For the determination of the deep belief network model, the first thing is to know the number of nodes in the visible layer and the hidden layer. The number of nodes in the visible layer is the input data dimension. Second, the number of nodes in the hidden layer is related to the number of nodes in the visible layer in some research fields, such as processing image data with convolution restricted Boltzmann machine, which is not analyzed here. However, in most cases, the number of hidden layer nodes needs to be determined according to the use, or the number of hidden layer nodes that minimize the energy of the model under certain parameters.

3.2. DBN Model Training

The training of DBN model is divided into two parts: unsupervised pretraining process based on RBM and supervised parameter adjustment process.

The unsupervised pretraining process of DBN model adopts the layer-by-layer greedy learning strategy. The initial input layer is the visible layer, and the input data are the text feature vector. The data vector of the visible layer combined with the weight is used to infer the data vector of the hidden layer , which is the training process of RBM1. Then, the data vector of the hidden layer is combined with the weight to infer the data vector of the hidden layer , which is the training process of RBM2, and so on. That is, multiple RBMs are stacked, the output of the previous RBM is the input of the next RBM, and the hidden layer of the previous RBM is the visible layer of the next RBM. By step-by-step training to the last layer, the pretraining process of DBN is completed. The specific steps are as follows:

Step 1. Randomly initialize the weight , in which is the weight vector matrix, is the offset coefficients of visible layer, and is the offset coefficients of hidden layer. is visible neurons, number is ; is hidden neurons, number is .

Step 2. Assign value to the visible layer and calculate the probability that the hidden layer neurons can be activated:

Step 3. Perform a Gibbs sampling to obtain the value of each neuron in the hidden layer:

Step 4. Reconstruct the visible layer with the obtained in formula (4) and calculate the probability density:

Step 5. Perform Gibbs sampling again and reconstruct the value of each neuron in the visible layer. Let :

Step 6. Calculate the activation probability of hidden layer neurons again with the reconstructed visible layer neurons:where adopts sigmoid activation function, and its function image is shown in Figure 2. Sigmoid is used to activate the function because its definition field is and its value field is (0, 1). Therefore, no matter what range the input data of neurons in the visible layer is, the activation probability of nodes can be obtained by sigmoid function.

Step 7. Obtain the new weight vector matrix , visible layer offset coefficient , and hidden layer offset coefficient :where is the learning rate.
To sum up, pretraining only needs to iteratively calculate RBM1, RBM2, and RBM3 parameters in turn and finally get the best weight .
The supervised parameter optimization training of DBN model first uses the forward propagation algorithm to determine whether the hidden layer neurons are activated by using the parameters and obtained in the pretraining. Let be the number of layers of the neural network and calculate the excitation value of each hidden layer neuron:Then, we propagate upward layer by layer, calculate the excitation values of neurons in all hidden layers using formula (9), standardize them with activation function, and finally calculate the excitation value and output vector of output layer:Then, the back propagation algorithm is used to update the parameters of the whole DBN network. The back propagation algorithm adopts the reconstruction error criterion, and the cost function is as follows:where is the reconstruction error, is the actual output of the output layer, is the theoretical output of the output layer, and represents the weight and offset coefficient of the layer . The reconstruction error can reflect the likelihood of the training data to a certain extent. Finally, the gradient descent (GD) algorithm is used to update the weight and offset coefficient of the whole DBN network:To sum up, the training purpose of DBN model is to maximize the fitting of input data, and the output result is the reconstruction of training data. The visible layer neurons transfer their own features to the hidden layer neurons. The hidden layer neurons capture the higher-level features shown by the visible layer neurons through iterative training, so as to enhance the ability of feature extraction of the model.

4. DBN Microblog Emotion Classification Model Based on Spark Parallel Optimization

Figure 3 shows the work flowchart of microblog emotion analysis of the proposed method. Before classifying microblog emotion, it must be processed into a form that can be calculated by computer, that is, the representation model of data. Then, an emotional dictionary is built, the emotional features are extracted in the microblog text, the extracted features are taken as input, the whole spark parallel DBN model is trained, the classification results are obtained, and the emotional analysis of the microblog text is realized.

4.1. Microblog Preprocessing and Feature Vector Construction
4.1.1. Preprocessing

Text preprocessing is an indispensable part of the task of text emotion analysis. In text comments, due to the great differences in everyone’s emotional thinking and speaking methods, it is often filled with strong personal emotional styles. All kinds of irregular grammar and nonstandard words will interfere with the task of text emotion analysis, so text preprocessing is very important. The text preprocessing part of this study includes as follows: filtering out repeated corpus, filtering out irregular words, removing stop words, emoticon processing, and Chinese word segmentation. The Chinese word segmentation part selects Jieba word segmentation. Jieba word segmentation can collect the dictionary established by users, and its Chinese word segmentation effect is good, which can well meet the needs of this study.

4.1.2. Feature Construction

Text feature selection is a key step of machine learning, which determines the accuracy of emotion classification. This study selects four categories of features: features based on emotional rules, unigram features, syntactic features, and dependent word collocation features. The rule feature based on emotion is the feature obtained by extracting its effective information after improving the new rule method on the basis of predecessors. Considering that phrase structure can reduce sentence ambiguity, we add bigram and its combined part of speech tagging as features to the feature set. Dependency feature is the dependency identifier obtained from the dependency parsing tree. It plays an important role in the annotation of emotional category information and can save the information directly related to emotional words and other hidden information.

The method based on emotion dictionary plays an important role in the development history of text emotion analysis. Its core idea is to superimpose the polarity of emotion words and judge the emotional tendency of the text by numerical value. The formula of the classical method is as follows:

In the above formula, the parameter represents the polarity of emotional word . The parameter represents the number of emotional words in the text. The method based on emotional dictionary can barely complete the task in some simple text tests, but considering the complex text grammar and the existence of various language structures in real use, the actual use is limited. Therefore, considering the defects of classical methods, a new emotion rule method is proposed. Considering that the length of the comment text is generally short and is basically a separated sentence, the method takes each clause as a meta unit. On the basis of considering the negative words, connectives, and other grammatical structures, the emotion calculation formula (equation (14)) is proposed to calculate the emotion tendency of each unit. The final text emotion tendency is judged by the value obtained by the superposition of the score values of each unit. If the score value is positive, the text emotion is classified as positive; if the score is negative, the text emotion is classified as negative:where the parameter represents the number of emotional words in the text, the parameter represents the emotional extremum of emotional word , the parameter represents the number of words modifying emotion word , the parameter represents the weight of the corresponding modifier, and the parameter represents the weakening or strengthening coefficient of rules. This parameter exists to solve a problem often ignored in emotion analysis tasks—the deviation of emotion analysis results caused by subject confusion.

Table 1 lists a brief description of the emotional rules designed by the proposed method. Generally speaking, the more complete the emotional rules are, the better the effect of the emotional rule method is. After combining the emotional rules, the final score is calculated according to formula (14), and then, three parameters are extracted as emotional features: the score of emotional words, the number of positive/negative emotional words, and the ratio of strengthening/weakening times of rules.

For the other three emotional features, “the scenic spot service is really good, I like it very much!” is taken as an example sentence to show the feature extraction process and the corpus is input into Jieba word segmentation to get “scenic spot /n service /n really /ad good /a , / I /rr very /d like / ! /”, where /n stands for noun, /ad stands for adverbial word, /a stands for adjective, / stands for punctuation mark, /rr stands for pronoun, /d stands for adverb, and / stands for verb.

Based on the above results of word segmentation and tagging, the syntactic features can be obtained: scenic spot service, service really, really good, good I, I very, like it very much, n, ad, a, rr, d, . The number of features is 12. After the result of word segmentation is obtained, the dependency and word collocation features of the input example sentences can be obtained by calling the StanfordNlp natural language processing toolkit. The specific relationship and collocation are listed in Table 2.

In practical use, in order to avoid various problems caused by excessive feature dimension, the feature selection method of information gain (IG) is adopted for feature dimension reduction. The formula is as follows:where the parameter is the probability of category , the parameter is the probability of feature , the parameter is the probability of simultaneous occurrence of feature and category , and the parameter is the probability that the category appears when the feature does not appear. The score of the feature is calculated according to the formula, and the feature of TOP N is selected according to the score, so as to select and reduce the dimension of the feature.

4.2. Parallel Optimization of Emotion Classifier Based on Spark Platform

The master node provides initialization parameters for training and distributes them to each worker node. Each worker node uses the training data on all split slices for parameter learning and uses minibatch as the criterion for training parameter update. When the worker node completes the training data of a batch, the generated parameter change is sent to the master management node for parameter update until all training is completed, and the feature data processed in each training are converted into RDD form for storage. The specific algorithm is shown in Algorithm 1.

Input: Training data set , set as the feature vector set after microblog preprocessing
Output: Emotion classification result set
Determine the number of iterations and the parameter for initializing RBM
For i = 0 todo
  The Master node broadcasts to each Worker node;
  The Worker node uses the data on Split to train the parameters of RBM network;
  All Worker nodes send to the Master node;
  The Master node calculates . The feedback mechanism of BP network is used to adjust and fine-tune the DBN network model.
End

The parallelization structure of DBN network based on spark platform is shown in Figure 4.

5. Experiment and Analysis

5.1. Experimental Data and Evaluation Indices

The dataset of this experiment comes from COAE2015 Task 3. There are 133201 microblog sentences, including a large number of interfering sentences. Datasets are divided into four different areas to evaluate, including books (BOO), audio products (DVD), electronic products (ELE), and kitchenware (Kit). Each dataset contains 2000 positive and 2000 negative comments.

In this study, the accuracy is used as the evaluation index of the experiment, and the calculation formula is as follows:where is the number of samples correctly predicted by emotion classification and is the total number of samples in the test corpus.

5.2. Relationship between Iteration Times and Prediction Accuracy

The advantage of deep neural network over shallow neural network is that it can iteratively learn, extract features, and constantly modify the model, but too high or too low iteration times will affect the overall performance. In a task, if the number of iterations is lower than a certain value, it will lead to incomplete learning of features and imperfect release of performance. If the number of iterations is higher than a certain value, it will take a too long time and be inefficient. Therefore, the selection of iteration times is very important in the task. In the experiment, with Ft1 as the feature, the relationship between prediction accuracy and iteration times is shown in Figure 5. It can be seen from the figure that when the number of iterations is less than 60, the recognition rate increases significantly with the increase of the number of iterations. When the number of iterations is 65, the change range of accuracy is small and almost reaches a balanced state. Based on the above analysis, for the number of iterations, 65 iterations are selected to ensure the stability of the results.

5.3. Experimental Results and Analysis of Emotion Classification under Different Methods

Table 3 lists the experimental results of the text emotion classification method based on deep belief network designed in this study. In the network, the input is the vector composed of 1000-, 2000-, and 4000-dimensional features with the top information gain. The text abstract features are learned through hidden layer nonlinear mapping. The specific results are as follows: for the 1000-dimensional feature set, the training iteration of restricted Boltzmann machine is 100 times, and the node parameter corresponding to the network structure “input layer-hidden layer-output layer” is “1000-300-100.” For the 2000-dimensional feature set, the training iteration of restricted Boltzmann machine is 100 times, and the node parameters corresponding to the network structure are “2000-600-300.” For the 4000-dimensional feature set, the training iteration of restricted Boltzmann machine is 100 times per layer, and the node parameters corresponding to the network structure are “4000-600-300.” It can be seen from Table 3 that the method based on depth belief network achieves the best classification accuracy of 90.94 when the structure is 2000-600-300 and the four features are combined.

In order to verify the learning and expression ability of the method in this study, the same features are used to compare the methods in reference [27], reference [28], and the proposed method. The recognition rates of reference [27] and reference [28] are 87.11% and 87.69%, respectively. When the structure of the proposed method is 2000-600-300, the combination of four features achieves the best classification accuracy of 90.94%. Moreover, it can be found that the overall accuracy of the proposed method is higher than that of the methods in reference [27] and reference [28], because the proposed method will obtain more emotional knowledge than the comparison methods in the learning of features, so as to obtain better performance, as listed in Table 4.

5.4. Microblog Emotion Analysis Results under Spark Platform

The DBN network is optimized in parallel under the spark platform. The spark cluster used in the experiment is composed of 10 servers. One server is used as the management node of spark cluster, and the other nine servers are used as the computing nodes of spark cluster. The hardware configuration is CPUXeonE5520, 20 GB memory, and 1 TB hard disk. In Figure 6, the abscissa represents the size of the training data and the ordinate represents the time-consuming. It can be seen from the figure that when the amount of data increases to 60000, the spark training time is only 27.78% of the single machine training time. The Jieba word segmentation method is used to reduce the interference of irregular grammar and nonstandard words on the emotion analysis task of microblog text. The feature dimensionality reduction operation is carried out by using the feature selection method of information gain to avoid the problem of dimension disaster. It can be concluded that the parallel DBN algorithm based on spark platform can effectively improve the operation efficiency when processing massive data.

6. Conclusion

Aiming at the problem that the existing methods in the big data environment do not extract the emotional features of microblog sufficiently and the average accuracy of the results is low, a microblog emotion analysis method using deep learning in the spark big data environment is proposed. The DBN is parallelized through spark cluster, which greatly shortens the training time. Experimental results show that the proposed algorithm has good microblog emotion analysis ability.

In this study, the factors considered in the study of data parallel fragmentation strategy are not comprehensive enough. More data fragmentation strategies should be tried in the future. In the follow-up, other parallel optimization algorithms can be used for reference to improve the parallel speedup ratio of the algorithm. Moreover, in addition to word vector representation, researchers have developed new representation methods in recent years, such as Atlas and tree database, to represent text information. Therefore, the text emotion classification algorithm proposed in this study can be further improved. How to embed more and more effective text semantic information is still the focus of the next step.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the 2020 Horizontal Project (no. HX2020029).