Abstract

Facial Expression Recognition (FER) is an active research field at present. Deep learning is a good method that is widely used in this field but it has extreme hardware requirements and it is hard to apply in normal terminal devices. So, many other methods are being researched to apply FER in such devices and systems. This work proposes fresh modeling of Combined Gray Local Binary Pattern (CGLBP) for extracting features in facial expression recognition to enhance the recognition rate that can apply FER in the kind of devices and systems. The work included the main steps such as the technique of cropping an input face image from a camera or dataset, the approach of dividing face images into nonoverlap regions for extracting LBP features, applying the fresh modeling of Combined Gray Local Binary Pattern (CGLBP) for extracting features, using uniform feature to reduce the lengths of descriptors, and finally using Support Vector Machine (SVM) for emotion classification. Four popular facial emotion datasets are used in experiments and their results demonstrate that the recognition rate of the proposed method is better in comparison with two types of existent features: Local Binary Pattern (LBP) and Combined Local Binary Pattern (CLBP). The accuracy of experiments performed on four facial expression datasets with different sizes is from about 95% to more than 99%.

1. Introduction

Facial expressions are facial changes in response to a person’s internal emotional states, intentions, or social communication as in [1]. Facial expression recognition is an active research field that is defined as extracting the emotional state of a person from his or her facial expression. FER is particularly useful for enhancing naturalness in human-machine interaction robotics and contributes to intelligent human-machine interface systems that can be applied in behavioral science, clinical practice, and day-to-day life. For applying, many FER systems have been built and three state-of-the-art FER systems among them as in [2] show that the systems classify basic emotions accurately and they are mostly on par with human raters for standardized images.

Deep learning is a good method that is widely used in this field and they overcome the limitations of conventional approaches, but it requires excessive hardware specifications and so the application of the deep learning method is rather limited, especially applied in real-time systems or normal terminal devices. For that reason, other methods without excessive hardware specification requirements have been still proposed for facial expression recognition. For example, [3] proposed a method of using a combination of Local Binary Pattern (LBP) features and Histogram Oriented Gradient (HOG) features to create the feature vector and after that, applying K-NN for classification. And Ref. [4] presented an algorithm with the combination of the oriented FAST and rotated BRIEF (ORB) features and LBP features extracted from facial expressions and using a Support Vector Machine classifier.

There are commonly two main methods to enhance the accuracy of an automatic facial expression recognition system. The first is discovering suitable features that more efficiently characterize different facial expressions. The second is finding classifiers that better classify facial expressions. In addition, determining a method of face image preprocess for extracting the feature that is most appropriate for the kind of features and classifiers used is very useful for improving the recognition rate.

To select the kind of feature for FER, there are commonly two main approaches that are geometric features and appearance features. Geometric features focus on the shape and locations of facial components, which are extracted to represent the face geometry. Whereas appearance features show the appearance changes (skin texture) of the face, which are extracted by applying image filters to either the whole face or specific facial regions. In facial expression modeling, appearance features are often chosen because they are good at initialization, tracking, and encoding changes in skin texture.

There have been many local image descriptors developed to analyze image texture, such as Local Binary Pattern (LBP) [5, 6], Completed Local Binary Pattern (CLBP) [7], Local Directional Gradient Pattern (LDGP) [8], Local Gradient Hexa Pattern (LGHP) [9], Rotation and Scale-invariant Hybrid Descriptor (RSHD) [10], Local Quadruple Pattern (LQPAT) [11], Center Symmetric Local Binary Pattern (CSLBP) [12], Bag–Of–Filters Local Binary Pattern (BoF-LBP) [13], R-Theta Local Neighborhood Pattern (RTLNP) [14], Improved Local Ternary Patterns (ILTP) [15], and Multi Threshold Uniform-Based Local Ternary Patterns (MT-ULTP) [16].

The descriptor researchers try to meet requirements for capturing texture information and reducing the length of the descriptor. Moreover, depending on the specific research domain, different image descriptor is applied. For example, LBP and CLBP are used in facial recognition and facial emotion recognition [3, 4, 17, 18]; LDGP, LGHP, LQPAT, CSLBP, CSLBP, BoF-LBP, and RTLNP are used in facial image recognition and retrieval [8, 9, 1114]; RSHD is used in color image recognition and retrieval [10]; ILTP is used in bark texture classification [15]; or MT-ULTP is used in cell phenotype classification [16].

This work proposes novel modeling of features for facial expression recognition called the Combined Gray Local Binary Pattern (CGLBP). The feature combines the Local Binary Pattern (LBP) and Local Gray Level Difference (LGLD) of face images. This feature is extracted based on an effective method of face acquisition employed as in [17]. For emotion classification, a Support Vector Machine (SVM) has been used.

2.1. Face Acquisition

The face acquisition process commonly has two main steps the basic step and the enhancement step. The basic step usually uses a face detector to discover the face region of an input face image from a camera or dataset and then eliminate its redundant regions. The enhancement step aims to improve the face region for extracting facial expression features with cropping methods, image normalization methods, or image filter techniques. After that, the face images are rescaled and used for feature extraction. The preprocessing of face acquisition is shown in Figure 1.

The face detector in the basic processing method that is used much in literature is the robust real-time face detector [19] proposed by Viola and Jones. This detector carries out a fast face recognition method based on Haar-like features and a cascade of classifiers. Experimental results approved that this method obtained good performance including both high accuracy and fast calculation. However, to attain better performance of facial emotion recognition, facial images obtained from the robust real-time face detector can eliminate redundant image regions by geometrically standardized or cropping methods before being used for extracting features. This normalization method is usually performed based on eye locations, lip states, nasolabial furrows, or crows-feet wrinkles, as in [20, 21].

For enhancement preprocessing, Reference [22] proposed a cropping method with 2 steps: first detect face landmarks and then remove the forehead region based on the horizontal distance between the eye centers as shown in Figure 2(a).

This work uses an enhancing processing technique presented in [17]. The face images obtained from the robust real-time face detector are cut by the cropped technique to get rid of unnecessary information or pixels with little information as in Figure 2(b).

Observations of the human face usually show that the forehead part of the human face height accounts for about a quarter of it and does not hold much essential facial emotion information. Therefore, to improve the accuracy rates of FER, two-thirds of the upper forehead part will be cut off and one-third of the lower forehead part close to the eyebrows is retained.

The objective of the cropping method is to take a smaller face image region but a higher rate of essential information about emotion. So, the method not only reduces the processing time in the feature extraction step and recognition step but also enhances the accuracy of facial expression recognition and saves computational resources.

Figure 3 illustrates the steps for face acquisition: Figure 3(a) is a face image from a dataset of emotional face images; Figure 3(b) shows a part of the face image (in red squares) obtained from 3(a) by using the robust real-time face detector; Figure 3(c) presents a part of the face image (in yellow square or small square) obtained from 3(b) by using the cropping method; and Figure 3(d) is the face image obtained from the process of face acquisition (or from 3(c)) for extracting features.

2.2. Face Feature Extraction

Commonly, facial representation consists of two ways: geometric features as in [23], and appearance features as in [17]. First, this work divides the face image into nonoverlap regions and then extracts LBP features as in Figure 4 presented in [17].

Then LBP histogram (or LBP feature) is calculated for each region. Finally, the feature vector of the face image is formed by concatenating the LBP histogram of each region from left to right and up to down as shown in Figure 5.

3. The Feature Modeling of Novel Combined Local Binary Pattern

3.1. Local Binary Pattern (LBP)

The LBP operator was recommended in [5, 6] and used as a complementary measure for local image contrast. The operator uses the value of a pixel and the circular eight neighbors of it to calculate the LBP code of the pixel. If the value of a pixel of a neighbor is greater than or equal to the value of the center pixel, it is labeled 1, otherwise is 0. The result (the eight pixels from the pixel in the upper left clockwise) is an 8-digit binary number of the pixel (also called LBP code). Then, the histogram is calculated using a decimal number converted from the binary number. Based on the operator, each pixel of the image is labeled by one LBP code value from 0 to 255 depending on its 3 × 3 neighborhood values. An example of the calculation of binary numbers and decimal numbers using the basic LBP operator is demonstrated in Figure 6.

To capture the dominant features of large-scale structures of an image better, the operator was extended to use neighborhoods of different sizes as in [6]. With circular neighborhoods and bilinear interpolation, the pixel values allow any radius and number of pixels in the neighborhood. The notation (P, R) represents a neighborhood of P equally spaced sampling points on a circle of radius of R to form a circular and symmetric neighbor set. Figure 7 shows examples of the extended LBP operators.

Generally, to extract facial expression features, the basic LBP operator is used instead of the extended LBP operator for the following reasons. The first reason is although the extended LBP operator can capture dominant information with large-scale textures of an image, it is time-consuming to process in comparison to basic LBP. The second reason is in facial expression recognition, facial images are usually normalized to be not large in size (such as 36 × 48, 48 × 48, 55 × 75, 64 × 64, 110 × 150 pixels) then the face images are split into smaller regions. It means that facial emotion is analyzed based on changes in these smaller regions (as shown in Figure 4).

The 256-bin histograms of LBP features that are called the non-uniform pattern operators usually take much time for processing. So, a uniform LBP pattern operator was proposed in [6]. A Local Binary Pattern is uniform if it comprises at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular. Some examples of uniform patterns are 11111111, 00001111, 01111000, 11000001, and 00011110. Experimental results of uniform LBP operator in [6] showed that in texture images, uniform LBP patterns account for 87.2% of all patterns in the (8, (1) neighborhood and for 66.9% of all patterns in the (16, (2) neighborhood.

The notation symbolizes a uniform LBP operator. The superscript u2 means the uniform pattern with the number of transitions; the subscript describes the operator using a (P, R) neighborhood. Uniform LBP uses only uniform patterns and labels all remaining patterns with a single label. A local binary pattern code is computed for a pixel in an image by comparing it with its neighbors as in equation (1):where is the gray value of the central pixel, is the gray value of its neighbors, P is the total number of involved neighbors, and R is the radius of the neighborhood. A histogram of a labeled image fk(x, y) can be defined as in equation (2):where n is the number of different labels produced by the LBP operator and

This histogram comprises information about the distribution of the local micropatterns, e. g., corners, edges, spots, or flat areas, over the whole image. To represent the face efficiently, the features extracted should retain spatial information. For this reason, the face image can be divided into m small regions R0, R1…, Rm as shown in Figure 5, and a spatially enhanced histogram is expressed aswhere i = 0…, n-1, j = 0…, m−1.

3.2. Completed Local Binary Pattern

The local binary pattern has been used as an effective feature for facial expression recognition. However, the local binary pattern only uses the conventional sign components and ignores the magnitude component. However, it is also discovered that the information contained in magnitude difference or local gray level difference can provide a significant performance improvement.

As said by [7], the difference between and can be computed as dp = , where is the central pixel and its circularly and evenly spaced neighbors ,  = 0, 1, …, P−1.

The local difference vector [d0…, dP-1] describes the image local structure of and can be decomposed into two components as in equation (5):where the sign of dp and mp is the magnitude of dp. Equation (5) is called the local difference sign-magnitude transform and it transforms the local difference vector [d0…, dP−1] into a sign vector [s0…, sP−1] and a magnitude vector [m0…, mP−1].

The completed local binary pattern includes the CLBP_S (sp) operator and the CLBP_M (mp) operator with the same format. To create a CLBP descriptor, the histograms of CLBP_S and CLBP_M can be combined by using one of the following two methods: concatenation or jointly. In the first method, the CLBP_S histogram and the CLBP_M histogram are calculated separately, then concatenated as one histogram. This CLBP descriptor is denoted by CLBP_S_M. In the second method, a joint 2D histogram of the CLBP_S and CLBP_M are calculated. This CLBP descriptor is denoted by CLBP_S/M. The feature has been researched for facial expression recognition and presented in [18].

3.3. The Novel Modeling of Combined Local Binary Pattern

The approach using CLBP_S_M in Section 3.2. takes much time to calculate mp = |dp|. In this work, an improved approach to modeling the completed local binary pattern is proposed, called Combined Gray Local Binary Pattern (CGLBP or CLBP_S_G). This modeling is based on the CLBP_S operator and images local gray level difference called CLBP_G for brief. Figure 8 shows an example of the transformation of CLBP_S and CLBP_G on a 3 × 3 sample block.

The sign component of CLBP_S_G is the same as the sign component of CLBP_S_M, where the CLBP_G operator is defined as in equation (6):where the threshold c is to be determined adaptively. Here, c is taken to be the mean value of over the CLBP_G operator.

Based on the idea of choosing an effective threshold c for CLBP_G in facial expression recognition, we tested the following different thresholds:The mean value of of the whole face imageThe mean value of of the regionThe mean value of of the CLBP_G operator

Experimental results on four datasets indicated that the threshold in the last case obtains the best accuracy in facial expression recognition. It can be explained as follows:One main weakness of the LBP operator is that it is very sensitive to noise. The label of patterns will be changed even if only one noise pixel occurred.Facial emotions mainly are expressed in some specific areas of the face (such as eyes, mouth, and eyebrows) as shown in divided regions in Figure 4.Thus, choosing threshold is the mean value of of the CLBP_G operator that can not only make the CLBP_S_G operator more tolerant of noise but also combines the difference of magnitude component.

Both the CLBP_S operator and CLBP_G operator have the same binary string format, so they can be used to create the CLBP_S_G or CLBP_S/G descriptors by concatenation or jointly combination method respectively.

4. Experiments and Results

4.1. Experimental Process

The work experimented on four typical image datasets based on the proposed approach. The first is the Japanese Female Facial Expression (JAFFE) dataset with 213 images, the second is Cohn-Kanade+ (CK+) dataset having 2,040 images, the third is the FG-NET database (or FEEDTUM) containing 2,268 images, and the last is MUG dataset consisting of 5,132 images. Cropped face images are normalized to the size of 64x64 pixels, and then they are divided into regions of 8x8 pixels. To evaluate the accuracy of the proposed method, the 3-fold cross-validation technique is used. Figure 9 illustrates the experimental process.

4.2. The Japanese Female Facial Expression (JAFFE) Dataset

The 213 gray images of ten Japanese female facial expressions are contained in the JAFFE dataset as in [24] or [25] in which each person represents seven different facial expressions including anger, disgust, fear, joy, neutral, sadness, and surprise. Most facial expressions of each subject have 3 different images. In particular, three cases have two images, and six cases have four images. Original images have a resolution of 256 × 256 pixels and grayscale values. In this work, all 213 images are selected as experiment samples. A few examples of facial expression images from the JAFFE dataset could not be shown in this paper because of complying with the conditions for using the dataset as in [26].

In Figure 10 shows that the accuracy rates of non cropped images (full images acquired from the robust real-time face detection algorithm or percentage 100% of ) for all three kinds of features (LBP, CLBP_S_M, CLBP_S_G) are almost less than the accuracy rates of cropped images (from 80% to 90% of ) on JAFFE dataset. The CLBP_S_G feature obtains the highest accuracy at 97.21% with a cropping percentage of to being 86% and almost better than LBP and CLBP_S_M features in the remaining percentages of to .

Table 1 shows the confusion matrix of the JAFFE dataset based on the CLBP_S_G feature with the cropping percentage of to w1 being 86%.

4.3. The Cohn-Kanade (CK+) Dataset

The Cohn-Kanade dataset is one of the most comprehensive datasets in the current facial expression research community as in [27]. The CK + dataset comprises a variety of subjects: 100 university students with an age range between 18 and 30, 65% are female, 15% are African-American, and 3% are Asian or Latino. Subjects were instructed to perform a series of 23 facial displays, six of which were based on the description of basic emotions consisting of anger, disgust, fear, joy, sadness, and surprise. Image sequences expressed from neutral to strong emotion have a resolution of 640 × 490 pixels and 8-bit grayscale values. A few facial expression images (character S55) from the CK + dataset are shown in Figure 11.

In Figure 12 shows an anger emotion image (character S52) of the CK + dataset and its CLBP_S_G feature histogram.

Figure 13 shows that the accuracy rates of non-cropped images for all three kinds of features (LBP, CLBP_S_M, CLBP_S_G) are almost less than the accuracy rates of cropped images (from 80% to 90% of w2/w1) on CK + dataset. The CLBP_S_G feature obtains the highest accuracy at 99.95% with the cropping percentage of w2 to w1 being 85% and almost better than LBP and CLBP_S_M features in the remaining percentages of w2 to w1.

Table 2 shows the confusion matrix of the CK + dataset based on the CLBP_S_G feature with the cropping percentage of w2 to w1 being 85%.

4.4. The FEEDTUM Dataset

The FEEDTUM or FG-NET dataset in [28]. Facial expressions and emotions from the Technical University Munich is an image dataset containing face images presenting a number of subjects performing the six different basic emotions defined by Eckman and Friesen as in [29]. The dataset contains material gathered from 18 different individuals. Each individual performed all seven expressions three times. Sequences are starting from the neutral state passing into the emotional state. The images were saved in 8-bit JPEG format with a resolution of 320 × 240 pixels.

This dataset has many challenges such as the emotional expressions of many subjects being weak and highly confused with other expressions, or there are a lot of subjects with beards, hair covering their forehead or wearing glasses, or subjects turning their heads sideways in different directions. In this work, all 18 subjects (9 men and 9 women) with a total of 2268 images are selected. Figure 14 displays a few examples of facial expression images of character 0002 in the FEEDTUM dataset.

Figure 15 shows a sad emotion image of character 0016 in the FEEDTUM dataset and its CLBP_S_G feature histogram.

In Figure 16 shows that the accuracy rates of non cropped images for all three kinds of features (LBP, CLBP_S_M, CLBP_S_G) are almost less than the accuracy rates of cropped images (from 80% to 90% of ) on the FEEDTUM dataset. The CLBP_S_G feature obtains the highest accuracy at 95.06% with the cropping percentage of to being 85% and completely better than LBP and CLBP_S_M features in the remaining percentages of to .

Table 3 shows the confusion matrix of the FEEDTUM dataset based on the CLBP_S_G feature with a cropping percentage of to being 85%.

4.5. The MUG Dataset

The MUG dataset was made by the multimedia understanding group as in [30]. It was created to overcome some limitations of other similar datasets that preexisted at that time, such as low resolution, uniform lighting, many subjects, and many takes per subject. The Internet user version dataset has 52 subjects. This version of the dataset includes 52 Caucasian subjects with an age range between 20 and 35, 22 females and 30 males (with or without beards). The original images are saved in .jpeg format with a resolution of 896 × 896 pixels.

The experiments used 50 subjects, in which there are 22 females and 28 males. Fifteen images of facial emotions expressed from less to more of each expression are selected. Because some subjects did not express enough seven facial expressions, 5,130 human face images were selected in total, including 750 anger images, 735 disgust images, 705 fear images, 735 joy images, 750 neutral images, 705 sadness images, and 750 surprise images. A few facial expression images from the MUG dataset are shown in Figure 17.

Figure 18 shows a surprise emotion image of the MUG dataset and its CLBP_S_G feature histogram.

Figure 19 shows that the accuracy rates of non cropped images for all three kinds of features (LBP, CLBP_S_M, CLBP_S_G) are almost less than the accuracy rates of cropped images (from 80% to 90% of ) on the MUG dataset. The CLBP_S_G feature obtains the highest accuracy at 99.14% with a cropping percentage of to being 90% and almost better than LBP and CLBP_S_M features in the remaining percentages of to .

Table 4 shows the confusion matrix of the MUG dataset based on the CLBP_S_G feature with a cropping percentage of to being 90%.

5. Conclusions

The research presents facial expression recognition using a novel Combined Gray Local Binary Pattern (CGLBP or CLBP_S_G) modeling with the technique of cropping face images in the preprocessing stage and the method of dividing face images into nonoverlap square regions in each stage of extracting features. The work also uses uniform features to reduce the lengths of descriptors. The experimental results on JAFFE, CK+, FEEDTUM, and MUG typical datasets show that the highest accuracy rate of the proposed feature modeling is better than that of LBP and CLBP feature modeling.

The proposed method used a small size image (64 × 64 pixels) to reduce the processing time and data transmission time, but the accuracy rate is quite high. The calculating time of the CGLBP is also faster than that of CLBP since it does not calculate mp (the magnitude of dp). These are advantages to apply in normal terminal devices or systems.

Data Availability

The datasets used to support the findings of this study are included in the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to express a big appreciation to Professor Michael J. Lyons for authorizing us to employ the JAFFE dataset, Professor Jeffery Cohn permitted us to use Cohn-Kanade + dataset, Professor Frank Wallhoff for giving us permission to use the FEEDTUM Dataset, and Professor Anastasios Delopoulos allowed us to exercise MUG dataset in this work. This research was funded by the University of Finance and Marketing (UFM) under Grant no. (CS-27-20).