Abstract

To enhance the performance of image classification and speech recognition, the optimizer is considered an important factor for achieving high accuracy. The state-of-the-art optimizer can perform to serve in applications that may not require very high accuracy, yet the demand for high-precision image classification and speech recognition is increasing. This study implements an adaptive method for applying the particle filter technique with a gradient descent optimizer to improve model learning performance. Using a pretrained model helps reduce the computational time to deploy an image classification model and uses a simple deep convolutional neural network for speech recognition. The applied method results in a higher speech recognition accuracy score—89.693% for the test dataset—than the conventional method, which reaches 89.325%. The applied method also performs well on the image classification task, reaching an accuracy of 89.860% on the test dataset, better than the conventional method, which has an accuracy of 89.644%. Despite a slight difference in accuracy, the applied optimizer performs well in this dataset overall.

1. Introduction

Soft computing is available in several applications due to its usefulness in modeling and optimization. Numerous studies have focused on image and video processing with objectives such as detection and tracking. Various models have been proposed, including neural networks, deep learning, fuzzy logic, and hybrid methods [1]. However, their practical use in applications remains problematic because many applications require higher accuracy than the available models can supply. Hybrid methods that combine two or more soft computing techniques can often enhance the efficiency of image and video retrieval processes [2]. In an image context, a 3D geographical information system (GIS) data plan for the WiMax network was integrated to optimize both the network performance and the investment costs, both of which are relevant to the required number of base stations and sectors [3]. In addition, soft computing plays an important role in GIS research [47]. One important aspect of implementing soft computing is the quality of the dataset. Soft computing can also be used to generate meaningful and human-interpretable big datasets by defining an interface between the numerical and categorical spaces, i.e., the data definition and the linguistic space of human reasoning [8]. Furthermore, datasets applied to investigate soft computing methods should use a benchmark dataset intended for validating various methods [1]. One example of applying soft computing for decision making was presented [9]; this is a new method named the neurofuzzy analytical network process. The presented method works based on both fuzzy logic and an artificial neural network. Another implementation of soft computing was proposed for tunneling optimization [10]. This model analyzes the relationship between the target tunneling responses and the impact of input parameters, including both geometrical and geological factors. The proposed implementation is useful in reaching robust and low-cost soft computing solutions in the mining industry [11]. Soft computing can be applied in environmental management to predict vehicular traffic noise using data such as the volume per hour, percentage of heavy vehicles, and average speed of vehicles as inputs to neural networks or random forests [12]. Six methods are used for modeling soil water capacity parameters that are important in environmental management of targeted areas [13]. In the aviation industry, a multilayer perceptron neural network is employed to diagnose aerospace structure defects: the classical method uses signal processing and data interpretation [14]. Soft computing has also been implemented in path categorization of airplanes [15]. Soft computing can also be applied for estimating the position and orientation of spacecraft, which is useful for space technology development [16].

Image classification and speech recognition remain demanding research topics, since they can be applied in various applications [17]. One example of an image classification method is a graph-based multiple rank regression model [18], for which the researchers presented a method that can reduce the losses in matrix data correlations that occur when an image is transformed into a vector suitable for image classification processes. An integrated recurrent neural network and a convolutional neural network (CNN), named the multipath x-D recurrent neural network (MxDRNN), has been proposed for image classification [19]. In addition, semisupervised deep neural networks implement a robust loss function to enhance image classification performance [20], and hyperspectral image classification has been widely used in many earth observation tasks, including object detection, object recognition, and surveillance. A new joint spatial-spectral hyperspectral image classification method based on differently scaled two-stream convolutional networks and spatial enhancement achieved improved classification performance [21]. Image classification for very high-resolution imagery (VHRI) is another challenging task because of the rich detail captured in the images. Many studies have focused on object-based convolutional neural networks (OCNNs) and proposed various innovations, such as integrating a multilevel context-guided classification method with an OCNN to achieve higher VHRI classification accuracy [22]. Image classification techniques have also been applied to medical applications such as breast cancer screening through histopathological imaging [23]. In addition, speech recognition research is useful for native language tasks, such as the implementation of deep neural networks for the Algerian dialect [24] and for code-switching among Frisian languages [25]. Other speech recognition research has concentrated on recognizing emotion from speech with regard to age and sex using hierarchical models [26]. A new approach for speech recognition based on the specific coding of time and frequency characteristics of speech using CNNs has been presented [27]. Visual object tracking by using an exponential quantum particle filter and mean shift optimization has been presented as an another challenge for object tracking [28].

The applied method employs the particle filter technique, a state estimation technique, to optimize the gradient descent optimizer. State estimation is often used in navigation and guidance applications and has sometimes been applied to other optimization methods. For example, for real-time traffic estimation, state estimation has been implemented using an extended Kalman filter instead of using Gaussian process regression models with respect to historical data [29]. A particle filter has also been implemented to adjust various parameters to improve image classification [3032] and for some application such as crack propagation filtering [33]. The gradient descent algorithm is mainly used to optimize an objective [34]. For instance, it was used to implement a demonstration of a morphing wing-tip for an aircraft to reduce low-speed drag [35]. Thermal power plants use state estimation to optimize various parameters [36]. The adaptive technique presented in this paper, which combines a particle filter with the gradient descent optimizer to adjust and improve the performance on image classification and speech recognition tasks, is evaluated using the PlanesNet [37] and TensorFlow speech recognition challenge [38] datasets.

2. Materials and Methods

2.1. Materials
2.1.1. PlanesNet Dataset

Future airport designs should provide improved passenger convenience, such as reducing airplane delays or requiring less check-in time. Air traffic management, as the backbone of the aviation industry, is one factor leading airports to become more intelligent [17]. Airplane detection is a fundamental task in tracking, positioning, and predicting the positions of airplanes. PlanesNet is a medium-resolution, labeled, remote sensing image dataset that can serve as training data for training machine learning algorithms [37]. The dataset consists of 20 × 20 RGB images labeled as “plane” or “no-plane” as shown in Figures 1 and 2, respectively. The “plane” images mainly consist of the wings, tail, and nose of the airplane. The images labeled “no-plane” may include land cover features such as water, vegetation, bare earth, or buildings and do not show any part of an airplane. Some example image data are presented in the following figures.

2.1.2. Speech Commands Dataset

Another dataset adopted in this study for testing the applied method is a public dataset for single word speech recognition, which was initially compiled for use in the TensorFlow Speech Recognition Challenge [38]. The dataset consists of audio files in which a single speaker says one word. The objective is to predict the audio files in the testing dataset, which are categorized in one of twelve categories: “silence,” “unknown,” “yes,” “no,” “up,” “down,” “left,” “right,” “on,” “off,” “stop,” and “go.” It should be noted that the applied method is based on a CNN, which is normally applied to 2D spatial problems. In contrast, audio is inherently a one-dimensional continuous signal across time. The dataset was preprocessed into images by defining a time window into which the spoken words fit; then, the captured audio signal is converted into an image by grouping the incoming audio samples into short segments, just a few milliseconds long, and calculating the strength of the frequencies across a set of bands. Each set of frequency strengths from a segment is treated as a vector of numbers, and those vectors are arranged in time sequence to form a two-dimensional array. This array of values can then be treated such as a single-channel image called a spectrogram.

2.2. Methods

The applied method is implemented based on a combination of a particle filter and minibatch gradient descent optimizer processes as expressed in equation (1) with the goal of obtaining a suitable optimizer for the target dataset:where is the weight, is the learning rate, and is a gradient of the cost function with respect to weight changes. Stochastic gradient descent (SGD) performs a parameter update after processing each training example and label , which means that the batch size is 1. The cost function in minibatch gradient descent is the average over a small data batch, which usually ranges in size between 50 and 256, but can vary depending on the application.

The applied method uses a generated particle process in combination with variables from the minibatch gradient descent optimizer. Consequently, the applied optimizer performs updates by using the computed variables instead of the conventional variables from the minibatch gradient descent optimizer. The applied method can be expressed as shown in the following equation:where is an adjustment value obtained from the particle filter process. is multiplied by the deep learning rate before being added to the second equation term of the conventional minibatch gradient descent optimizer in equation (1). Figure 3 illustrates the working process of a particle filter. It works based on historical information from the prior stage. PF works iteratively by generating a particle, propagating it to the next time step , and then performing an update to obtain an accurate value of the time step. A workflow of the applied method to obtain the value is depicted in Figure 4.

The applied method shown in Figure 4 is described as follows [32]:(1)Initialization: at , generate particles, and set their weights to (2)For (a)Input the particle set by using the system model equation, which is determined by the particle plus a value from the Gaussian process with zero mean and whose variance is equal to the deep learning rate(b)Predict the observation value by using with the measurement value assigned based on the mean of the prior iteration(c)Update the particle weight based on the observation vector by or the observation model, which is set to 1. Calculate the importance weight using .(d)Normalize the weights according to . Particle rejection or retention depends on the weight and multinomial resampling, which is determined by the resampling algorithm.

3. Results and Discussion

3.1. Image Classification Result

This experiment uses the inception_v3 model, which is a pretrained model intended for image classification applications. The PlanesNet dataset deployed in this experiment has a total of 18,085 images divided into two classes (7,995 “plane” images and 10,090 “no-plane” images). The data are divided into a training set with 14,377 images and a testing set with 3,708 images. The training batch size is set to 100, the leaning rate is 0.001, and the deep learning computation requires 10,000 epochs.

The results of the applied method are compared with those of the conventional gradient descent optimizer. The applied method shows three cases (with different numbers of particles and particle filter iterations in parentheses). The results of the applied method and those of the gradient descent optimizer for image classification in Table 1 reveal that iterations using the applied method (180, 300) achieve the best performance as measured by the mean cross entropy in every iteration (0.3193) and by the final test accuracy (89.860%). The applied method (50, 50) achieves the best performance with regard to mean accuracy (87.4291%), which is calculated after every iteration.

The accuracy and cross entropy after each deep learning iteration are shown in Figure 5. The graphs do not clearly express different model efficiencies because the performance improves only slightly as shown in Table 1. However, both accuracy and cross entropy (Figures 5(a) and 5(b), respectively) present the values of the corresponding trends for the applied method and the conventional method.

The confusion matrices for all cases are shown in Figure 6, clearly revealing that the applied method with 180 particles and 300 particle filter iterations achieves the best prediction result for the category of “no-plane;” however, it shows poor prediction results for the “plane” category. The confusion matrices for the other three results in Figures 6(a), 6(b), and 6(d) show no large differences in either the “plane” or the “no-plane” categories. These results imply that differences in the number of particles and the number of iterations in the particle filter affect the overall performance of the applied method. Thus, each application should select the most appropriate model based on user requirements and acceptable model accuracy.

3.2. Speech Recognition Result

A simple deep CNN is used in this experiment to generate a model for the audio file. The models are trained for 25,000 epochs with a batch size of 100 and a learning rate of 0.001. The audio files include 105,829 individual files: 100,939 in the training dataset and 4,890 in the testing dataset. Similar to the image classification experiment, this experiment compares the results of the applied method under different numbers of particles and particle filter iterations with the results from the conventional minibatch gradient descent optimizer.

The results are presented in Table 2, which show that the applied method (50, 50) achieves exceptional performance compared to the other models and obtains the best mean accuracy (77.8163%), mean cross entropy (0.6772), and final test accuracy (89.693%). The conventional minibatch gradient descent optimizer is the second best. From these results, we can conclude that the applied method configured with an appropriate number of particles and particle filter iterations can achieve a better performance than the conventional method. The accuracy and cross entropy results after each iteration are illustrated in Figure 7, which did not reveal obvious overall differences; therefore, the improvements are listed in Table 2. Confusion matrices are presented in Figure 8. The applied method (50, 50) shows exceptional performance on the “no,” “right,” and “off” classes. However, the conventional method achieves the best performance on the “yes,” “down,” and “go” classes. The other two versions of the applied method achieve a good performance on the “unknown” class. Finally, the applied method (150, 100) achieves the best results on the “left” and “on” classes.

The overall results of the speech recognition experiment show that the applied method performs better than the conventional method in terms of both accuracy and cross entropy. However, the confusion matrix results should be considered in detail before selecting the most suitable model for a given application.

The overall performance of using the applied method with image classification and speech recognition provides better accuracy. However, confusion matrices for both image classification and speech recognition illustrate some failure cases that remain a challenging task for further research. This is a very important consideration for some applications that require high precision of image classification, such as in the health care industry, or high precision of speech recognition, such as in rescue processes. Therefore, the applied method in this experiment, based on state estimation and a well-known optimizer, is helpful to slightly improve performance in both applications. To apply this method in practical applications, more consideration of acceptable cases and failure cases using confusion matrices is required to reach optimal performance.

4. Conclusions

The goal of this study was to use the particle filter technique to optimize a variable in a gradient descent optimizer. The applied method was validated by applying it to two different types of public datasets: the PlanesNet dataset (for image classification) and the Speech Commands dataset (for speech recognition). Moreover, three variations of the applied method that use different numbers of particles and different numbers of iterations were tested on those two datasets: the three model variations used 50 particles and 50 particle filter iterations, 150 particles and 100 particle filter iterations, and 180 particles and 300 particle filter iterations, respectively. The overall results show that the applied method achieves exceptional performances on both datasets, obtaining higher accuracy and lower cross entropy than the conventional method. The experiments also showed that the number of particles and the number of iterations used in the particle filter process affect the model’s overall performance. Therefore, to build a high-accuracy model, appropriate parameter values should be selected for the particle filter process in the applied method according to each application. A confusion matrix can be used as an assistive tool to select the most suitable model for a given application.

Data Availability

The data used to support this study are available at PlanesNet Dataset and Speech Commands Dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank the staff of the International Academy of Aviation Industry, King Mongkut’s Institute of Technology Ladkrabang, for their contributions to this article. This research was funded by Academic Melting Pot, the KMITL Research Fund, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand.