Abstract

Hand gesture recognition is an intuitive and effective way for humans to interact with a computer due to its high processing speed and recognition accuracy. This paper proposes a novel approach to identify hand gestures in complex scenes by the Single-Shot Multibox Detector (SSD) deep learning algorithm with 19 layers of a neural network. A benchmark database with gestures is used, and general hand gestures in the complex scene are chosen as the processing objects. A real-time hand gesture recognition system based on the SSD algorithm is constructed and tested. The experimental results show that the algorithm quickly identifies humans’ hands and accurately distinguishes different types of gestures. Furthermore, the maximum accuracy is 99.2%, which is significantly important for human-computer interaction application.

1. Introduction

With the rapid development of computer technology and artificial intelligence, noncontact gesture recognition plays important roles in human-computer interaction (HCI) applications [14]. Due to its natural human-computer interaction characteristics, the hand gesture recognition system allows users to interact intuitively and effectively through a computer interface [5, 6]. Additionally, gesture recognition based on vision is widely applied in artificial intelligence, virtual reality, multimedia, and natural language communication [710].

However, traditional hand gesture recognition based on image processing algorithms was not widely applied in HCI because of its poor real-time capacity, low recognition accuracy, and complex algorithm. Recently, gesture recognition based on machine learning has been developed rapidly in HCI due to the introduction of the graphics processor unit (GPU) and the artificial intelligence (AI) image processing [11, 12]. The machine learning algorithms such as local orientation histogram, support vector machine (SVM) [13], neural network, and elastic graph matching are widely used in gesture recognition systems [1416]. Owning to its learning ability, the neural network does not need manual feature setting during the simulating human learning process and can carry out training the gesture samples to form a network classification recognition map [17, 18]. Deep learning models are inspired by information processing and communication patterns developed from biological nervous systems, which involve neural networks with more than one hidden layer. They can acquire the characteristics of the learning object easily and accurately under the complex object and exhibit superior performance in computer vision (CV) and natural language processing (NLP) [1921]. Current state-of-the-art object detection systems are variants of Faster R-CNN [22]. The Single-Shot Multibox Detector (SSD) further optimizes object detection [23, 24]. As compared to Faster R-CNN, SSD is more simple and efficient as it completely eliminates proposal generation subsequent pixel and feature resampling stages, and it also encapsulates all computation in a single network which makes SSD easily trainable and straightforward to integrate into systems [5, 2528].

This paper discusses hand gesture recognition in complex environments based on the Single-Shot Multibox Detector. The approach is different from the work [28]. The image pyramid method is adapted to gesture recognition. More accurately, the system crops the image into blocks to detect far and small hand gestures. The experiment results show the SSD overcomes the interference signals in complex backgrounds and improves the accuracy and processing speed of gesture recognition.

Generally, the process of vision-based hand gesture recognition system includes three steps which are hand segmentation, gesture model building, and hand gesture classification. To increase the efficiency, we simplify the process into two steps by using the SSD network. More precisely, we just need a convolutional neural network such as VGG16 [29] as a model system to identify the gesture features and then proceed with hand segmentation and gesture classification simultaneously by the SSD network. This makes our architecture much simpler and much faster than other methods based on the Faster R-CNN model.

The main purpose of gesture model building is to obtain useful semantic features, separate them from the complex backgrounds, and provide effective input information source for the following stage. In the stage of hand segmentation and hand gesture classification, hand postures with different sizes will be located with different bounding boxes. For these bounding boxes, simultaneously, we acquire the confidence for all gesture categories. Training is used for this unified framework to acquire an effective recognition model; recognition output is based on the model that has been trained to identify the gesture categories of input data. In other words, given an input image, we can acquire the location and classification score of hand gesture in this image end-to-end.

The standard hand gesture database is important for the hand gesture recognition system. Figure 1(a) shows the 36 hand gestures from the Massey University’s 2D Static hand gesture image dataset which is about standard numbers and letters [30]. Note that some gestures are rather difficult to distinguish from each other. For example, “a” and “e,” “d” and “l,” “m” and “n,” or “i” and “j.” In this paper, we have chosen the characters of “w,” “o,” “r,” and “k” as the study objects which are shown in Figure 1(b). The Canon EOS 6D camera was employed to capture the gesture with an EF 24–105mm/4L IS USM lens and a shutter time of 1/100 S. And the maximum distance is about five meters. Each hand gesture sample was obtained under three different complex backgrounds, aiming to prove the applicability and reliability of the hand gesture recognition system.

The hand gesture model building plays a vital role in a gesture recognition system that is regarded as the first step for processing the original input gestures. The inputs of this stage are images. When seeing an image, from the perspective of human beings, we can catch the sight of the scene described in the picture. However, the computer cannot capture these scenes from an original picture. The computer thinks an image is just a matrix with a variety of values in different spatial locations and channels. In other words, the computer can only obtain pixel-level information of an image. Obviously, it is difficult to distinguish different objects using low-level information such as pixel values. Therefore, if we want to recognize hand gestures, one of the most efficient methods is extracting and summarizing high-level information such as their features and structures from the original image. This is exactly what gesture modeling does in our framework. We use the VGG16 convolutional neural network, which uses 13 convolutional layers and is deep enough to obtain high-level information of hand gestures. Given the original image as the input, the VGG16-Net will output feature maps of different resolutions which contain high-level information of the image. The reason for choosing 19 layers is that it is enough to extract high-level semantic information for classification and regression. And limited by the size of our dataset, using high-level layers can easily lead to overfitting.

The VGG-Nets are a series of convolutional neural networks with different depths which all use very smaller () convolution filters. The VGG16-Net (16 weight layers) is one of them which has 13 convolutional layers and 3 fully connected layers. The structure of VGG16 is shown in Figure 2. In this figure, the convolutional layer parameters are denoted as “conv < receptive field size > − < number of channels>.” The ReLU activation function is not shown for brevity. The original image is passed through a stack of convolutional layers, which use filters with a small receptive field: (which is the smallest size for capturing the notion of the left, right, up, down, and center). The convolution stride is fixed to 1 pixel; the spatial padding of a convolutional layer is such that the spatial resolution is preserved after convolution, i.e., the padding is 1 for convolution filters. Spatial pooling is carried out by five maxpooling layers, which follow some of the convolutional layers (not all the convolutional layers are followed by maxpooling layers). Maxpooling is performed over a pixel window, with stride 2.

All convolutional layers are equipped with the rectification nonlinearity (ReLU) [31]. After a stack of convolutional, maxpooling, and ReLU layers, we get feature maps with lower resolution and stronger semantic information. There are also fully connected layers and a soft max layer which are used for image classification in the original VGG16-Net. We replace these layers with SSD layers to implement hand segmentation and hand gesture classification.

The second stage, i.e., using the SSD network to perform hand segmentation and hand gesture classification, is the most important part in our framework. We have chosen the SSD model because it is both accurate and fast. The core of SSD is predicting category scores and bounding box offsets for a fixed set of default bounding boxes using very small () convolutional filters applied to feature maps. Beyond that SSD produces predictions of different scales from feature maps of different scales and separates predictions by aspect ratio. This architecture leads to simple end-to-end training and high accuracy, further improving the speed versus accuracy trade-off [5].

SSD is based on a feed-forward convolutional neural network (VGG16) that produces a fixed-size collection of bounding boxes and scores for the presence of object class instance in those boxes. This approach will produce a large number of bounding boxes, and most of them are covered by each other. Therefore, a nonmaximum suppression step is executed to discard repetitive bounding boxes and produce the final detections. The structure of SSD is shown in Figure 3. The input image is an image with pixels and RGB channels. The part in the dotted box is the truncated VGG16 network. The SSD model adds several feature layers of different scales to the truncated VGG16 network. These layers decrease in size as depth increases and allow predictions of detections at multiple scales. Then, small convolutional filters apply to every position in selected feature maps. More precisely, these filters apply to a set of default boxes of different aspect ratios at each location in several selected feature maps to predict the shape offsets and the confident scores for all object categories. In our work, object categories include four hand gestures and the background.

Noting that we have the SSD framework, the next thing we need is an objective function to train the model end-to-end. The overall objective function is a weighted sum of the localization loss (loc) and the confidence loss (conf):where N is the number of default boxes that match to ground truth boxes. The localization loss is a smooth L1 loss between the ground truth box () and the predicted box (l) parameters. These parameters are offsets for the center coordinate (cx, cy) of the default bounding box (d) and for its width () and height (h), which is similar to Faster R-CNN [22]:

The confidence loss is the soft max loss over multiple class confidences (c), as is usually used in multiple classification tasks:

During training, we match the default boxes to the ground truth boxes to calculate and reduce the loss of objective function. We do this recursively to optimize the parameters of the SSD model and finally get an ideal model. By using k-means clustering to guide the aspect ratio of anchor boxes, we get three different ratios. After that the ratios are 1.9, 1.6, and 1.1 with slight adjustment, respectively. Furthermore, the used optimizer is Adam with an initial learning rate of 0.0001.

3. Results and Discussion

The hand gesture recognition system was built by the SSD algorithm and training each character gesture with 1070 images with three different complex backgrounds. Then, we used 268 images which were not in the training set to test the building recognition model. The testing results of the recognition model on characters “w,” “o,” “r,” and “k” show good performance. In all 268 images, 261 of them are recognized correctly, with an accuracy of more than 93.8% and the highest recognition accuracy of 99.2%. The average prediction confidence for the 261 images recognized successfully is up to 0.96, which is very close to 1. Examples of visualization results are shown in Figures 47 with the character “w,” “o,” “r,” and “k,” respectively.

To evaluate the comprehensive performance of the gesture recognition system, the recognition accuracy for each hand gesture and response time was tested. The average accuracy of the gesture recognition system and response time are shown in Table 1. All the accuracies are more than 93.8%, and the character “o” owns higher accuracy. All response times are less than 20 ms which shows that the system exhibits high real-time performance.

The proposed work contributes to promote the accuracy of the hand gestures recognition as alphabets (“w,” “o,” “r,” and “k”) with the employment of SSD and image cropping. The results show that the adopted classification approach exhibits superior performance, which clearly indicates that the proposed system is an effective method for the hand gestures recognition. It is found, by comparing with other works, that the accuracy of the proposed method adopted in our work is higher than that of others which are listed in Table 2.

4. Conclusion

The Single-Shot Multibox Detector (SSD) deep algorithm is proposed to apply to the hand gesture recognition. We chose four character’s hand gestures under three different complex backgrounds as the investigated objects. The 19-layer convolutional neural network is used as a recognition model with learning and training the selected characters end-to-end. The system test results show that the hand gesture recognition system based on the SSD model performs efficiently, reliably, quickly, and accurately. The response time of the system is less than 20 ms revealing high real-time performance. The minimum accuracy is more than 93.8%, and the maximum is 99.2%. The research results show that the SSD algorithm can be used in the hand gesture recognition system for the human-computer interaction application.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the financial supports by the Shandong Key Research and Development Plan Project (grant no. 2019GGX105018), National Key R&D Program of China (grant no. 2017YFE0112000), and Shanghai Municipal Science and Technology Major Project (grant no. 2017SHZDZX01).