Abstract

Gender classification from human face images has attracted researchers over the past decade. It has great impact in different fields including defense, human-computer interaction, surveillance industry, and mobile applications. Many methods and techniques have been proposed depending on clear digital images and complex feature extraction preprocessing. However, most recent critical real systems use thermal cameras. This paper has the novelty of utilizing thermal images in gender classification. It proposes a unique approach called IRT_ResNet that adopts residual network (ResNet) model with different layer configurations: 18, 50, and 101. Two different datasets of thermal images have been leveraged to train and test these models. The proposed approach has been compared with convolutional neural network (CNN), principal component analysis (PCA), local binary pattern (LBP), and scale invariant feature transform (SIFT). The experimental results show that the proposed model has higher overall classification accuracy, precision, and F-score compared to the other techniques.

1. Introduction

Extracting humans’ traits and personalities automatically has attracted researchers over decades. It helps in different life fields. The eruption of Internet and smart devices allowed developers and engineers to embed different sensors around the users. For example, smartphones and smart watches are equipped with different sensors, such as gyroscopes, acidometers, cameras, and temperature sensors. The output of these sensors can reveal different hidden information on users’ trails and habits [1].

It has been shown that human face reveals traits, ethnicity, gender, age, and feelings [2]. It is a challenging task for computer vision researchers. Gender detection from facial images has applications in surveillance and human-computer interaction systems (HCI). Human faces supply serious visual information for perceiving gender [3]. This information can be utilized in different fields, such as online marking and advertisement [4], user authentication [5], security surveillance, language converting, online family protection, and image searching [6]. In these systems, a camera is required to capture human faces and analyze them for extracting useful information. This operation can be performed online and offline depending on the computational capability of the devices [7]. To start the analysis, face detection algorithms are required to extract faces from videos or images. Many techniques and algorithms have been proposed for face detection process [4, 8]. Subsequently, the output of this step is utilized for further analysis to extract, classify, or predict different information from these faces. Nevertheless, many issues encounter the imaging process, such as nighttime that converts the imaging devices’ output into low quality images without detecting information from it. In addition, illumination in the daytime impacts the detection process. Thermal imaging has emerged to tackle these issues.

Two main types of thermal imaging have spread in the past few years: near infrared imaging (NIR) and far infrared or thermal imaging (FIR). In these types, the camera device can detect and record the thermal distribution over objects that produce heat, such as human body. Subsequently, this distribution is recorded according to its value with different colors in the image. This technology has been utilized for face detection and recognition [9, 10]. It has also shown a great potential for face detection and recognition in nighttime and dark environment [11]. However, it is possible to extract other traits and personalities from faces detected in these images. Such information enhances the security and surveillance applications of NIR cameras.

In this work, a new face thermal image gender classifier (IRT_ResNet) is proposed. The model utilizes ResNet 101 CNN that consists of 101 layers. 3366 thermal images have been leveraged for training and testing purposes. The model has been compared to normal CNN, PCA, local binary pattern, and scale invariant feature transform. IRT_ResNet model overcomes these models in the accuracy of gender classification of faces in thermal images.

The rest of this paper is organized as follows: In Section 2, related work that has been conducted on gender classification and age detection in images utilizing machine learning algorithms and techniques is overviewed. Section 3 introduces the IRT_ResNet proposed model. Section 4 overviews the experiments that have been conducted and the comparison results. The paper conclusion and future suggestions are given in Section 5.

Gender detection and recognition from face images have gained great interest from researchers over the past decade. The eruption in machine learning algorithms and their applications in image processing improved the detection of different human properties and personal traits from images [1214]. One of the important traits is the gender of the faces detected in the images [15]. To classify the faces into their genders, many machine learning methods and techniques have been leveraged. The following subsection overviews these techniques.

2.1. Machine Learning in Gender Classification

Using supervised machine learning algorithms in image classification can be divided into two main classes: feature extraction as preprocessing and raw data usage for classification. In the first class, the developer and the designer are required to extract features from the face images and to feed them into a machine learning classifier. For example, the authors in [16] utilized a combination of shifted filter responses (COSFIRE) [17] to extract features from points of interest in the face images. In their method, a collection of Gabor filters was used. The outputs of the COSFIRE trained filters were fed into a SVM model for classification. GENDER-FERET dataset [18] has been used. Approximately 470 images for training and testing have been utilized. An accuracy of 93.7% has been reported. In [19], the authors proposed an age-gender classifier that can be trained with a small number of images. The proposed method extracted textures and shape information from faces based on canny edge detection method. Different areas of the face, such as nose and mouth, have been subsequently shaped with the edge detection. Neural network model has been trained for the classification process. An accuracy of 94% has been reported. In [20], the authors extracted several features from face images, such as rectangular features, local binary pattern, and wavelet coefficients. After that, AdaBoost [21] classifier was trained and compared to SVM and PCA algorithms. A combination of three datasets has been used to obtain a total of 4245 images. An accuracy of more than 99% has been reached.

Wavelet and local binary pattern have been also utilized for feature extraction in [22]. Minimum distance classifier has been trained on the FERET dataset, and an accuracy that acceded 99% has been reported. In [23], the authors combined fuzzy rules with face shapes and textures to train and create their model. An accuracy of 85% has been reported on FERET dataset. In [24], the author extracted geometrical features from the images and utilized these features with PCA of the facial extracted features. Subsequently, a near neighbor classifier has been trained to reduce the complexity of the system. The author attempted to predict the ages and classify the gender of the faces in the images. In [25], facial features have been utilized for age and gender classification. The lips in each image were extracted and utilized for the classification process. A multistage SVM model has been trained with the extracted features to classify the image into child, adult, and old classes. In [26], the authors attempted to segment the faces into six different segments: hair, background, lips, eyes, skin, and mouth. Subsequently, probability maps were assigned to these segments. These maps have been leveraged as features to train a random forest classifier. The authors trained the model and tested it on four different public datasets: Adience, LFW [27], FEI [28], and FERET. The accuracy reported on these datasets was 91.4, 93.9, 93.7, and 100, respectively. All proposed methods in this class require an image preprocessing step, feature extraction and selection. This process can fail with glasses and hats that can cover facial features.

In the second class of gender classification based on machine learning algorithms, features are not extracted. Face images are fed into the classifiers as raw pixel data. Subsequently, the classifier model attempts to extract features in its layers. Moreover, the raw data diminutions are reduced in each layer of the model. In [29], the authors generated a method to classify and detect the gender of faces in images utilizing CNN. Five different layers of convolution filters, pooling, and flattering have been implemented. A public dataset, UTKFace [30], has been used for training, testing, and validation. 16K images and 2K images have been used for training and testing, respectively. Images has been leveraged to change the face orientation in the images to reduce the overfitting impact. An accuracy of 90% has been reported. In [31], CNN has been utilized with three convolutional layers, each followed by a rectifying and pooling layer. In the end, two fully connected networks with 512 nodes have been used for age detection and gender classification. The Adience dataset has been leveraged for the training and testing of the module. 26K images were contained in the dataset of more than 2K humans. The CNN module with this sum of images requires massive computation for training. To train it, Amazon GPUs have been used with more than 1.5K cores. The model obtained 87% accuracy in gender classification. This method generated a complex model that requires massive computing for the training process. Another method that generates a complex model and requires massive computing for the training process is introduced in [32]. However, after training the model, no farther image preprocessing is required.

Recently, human identification in smartphone applications plays an important role in different perspectives such as login permission and sign-up certificates. So accurate gender classification algorithms may increase the accuracy of smartphone applications and reduce their complexity. In [33], the researchers proposed new approach of rotation invariant for classifying genders depending on human face images, based on improved local binary pattern (ILBP). That is because the disadvantage of LBP, extracting spatial structure information and local contrast. ILBP solves some factors such as the sensitivity of noise and rotation, in addition to the low discriminative features by using a modern theory for binary patterns categorization. The feature vector is extracted for images based on ILBP. After that, the classifier of Kullback–Leibler divergence is used for classifying gender.

2.2. Gender Classification from Thermal Images

Thermal infrared images capture the temperature distribution over the muscles and vessels over the human body. This distribution of temperature can be utilized as a facial feature in face images to detect faces, classify genders, and detect other trait personalities of the humans [34]. To classify the gender in thermal infrared images, machine learning can be leveraged as in the normal face images. However, the features in these images are harder to extract and to classify. A comparative study of Haar wavelet and local binary pattern for facial texture feature extraction of thermal face images has been carried out in [35]. The thermal images have been preprocessed and cropped to reduce their sizes [36]. Subsequently, a vector of wavelet coefficients has been extracted and combined with the local binary pattern of the image pixels. The output of this process has been fed into two machine learning classifiers: an artificial neural network model and a minimum distance classifier. An accuracy of 95% has been recorded for this method. In [37], the authors attempted to detect texture facial features of thermal images based on AdaBoost and Haar algorithms. Subsequently, complex Gaussian distribution has been utilized to detect the relation between these features. Results have shown that facial features can be detected easily in thermal images. Reference [38] proposed a machine learning algorithm for face detection in thermal infrared images. The method leverages Haar for feature extraction. In [39], a hybrid model of face gender classification based on normal and thermal images has been proposed. Features from normal images and temperature texture have been combined to create feature vector for the classifier. In [40], a combination of visible normal face images and thermal images has been used for gender classification. These methods were complex since two types of images were required.

Although, natural gender classification is used in many different real applications, thermal cameras are now widely used due to their limitless potentials. Current critical systems operate on thermal images for surveillance, skin temperature screening, security, and military applications [40, 41]. To reduce feature extraction and selection complexity, CNN has been implemented on thermal infrared images recently. Reference [42] applied CNN for gender classification on RGB-D-T dataset. Three experimental scenarios have been implemented. CNN accuracy has exceeded that of LBP, HOG, and moment invariants. In [43], CNN with thermal images has been used for liveness of faces detected in images. CNN has been compared to neural network and SVM. The authors of [5] leveraged thermal infrared images with CNN algorithm for security authentication applications. They claimed that the proposed algorithm overcomes other authentication algorithms in dark places.

This work differs from the above surveyed works in two main aspects. First, the proposed method utilizes thermal infrared images from different sources with variable sizes. Second, it utilizes ResNet 101 CNN that consists of 101 layers to enhance the feature extraction process. For the best of our knowledge, this is the first work that utilizes ResNet model in the area of thermal images. ResNet is the winner in ImageNet 2015 image recognition competition [44] and becomes a breakthrough in image processing since.

3. The Proposed Model

The proposed model (IRT_ResNet) adopts ResNet deep convolutional neural network for infrared thermal images. In neural network, with the addition of new layers to the model, the accuracy of the model increases. However, with the addition of new layers, the training process or the backpropagation step to train the model becomes harder and the accuracy befits saturated or degrading. ResNet tackles this issue by the skipping connection process between the stacked convolutional layers in the model. Figure 1(a) shows a normal stacked layer of CNN, where the output of a layer is the input of the next layer, as a connected network. However, in Figure 1(b), it can be recognized that the process inputs to the next stacked layer which is the summation of the output of the first convolutional layer and the real input of the old layer. The skipping connection step reduces the impact of vanishing gradient issue that reduces the accuracy of the network [45]. Moreover, this allows stacking a number of layers in CNN models and reduces the training time and computation. The block diagram of the proposed model is depicted in Figure 2.

3.1. The Convolutional Layer

The most important component of any convolutional neural network architecture is the convolutional layer [46]. It contains a set of convolutional filters (also called kernels); to get output feature map, it must get convolved with the image (N-dimensional metrics).

Kernel is a grid of discrete numbers or values; each value represents the kernel’s weight. All these weights are assigned random values in the start of the training process of a CNN model. Then, in each training period, the weights are modified, and the kernel is learned for extracting significative features. Convolution operation is used to understand the CNN input. In other classical neural networks, the input is in a vector format, whereas in CNN, the input is a multichannel image (e.g., three channels in RGB image, single channel for grayscale image).

The following example explains how the feature map is constructed using the convolution operation. Let the input be an image of 4 × 4 dimension (Figure 3(a)) and a 2 × 2 kernel with randomly initialized weights (Figure 3(b)). Then, the convolution operation takes the kernel and slides it all over the image (4 × 4) horizontally as well as vertically. In the method, the dot product between the input image and the kernel is taken by multiplying the corresponding values of them and then summing up all these values to generate one scalar value in the output feature map. This process stops when the kernel can no longer slide further.

Figure 4 illustrates the stages of the process more clearly. The 2 × 2 kernel values (shown in light blue color) are multiplied by those in the same-sized region (shown in yellow color) within the 4 × 4 input image. The resulting values are summed up to obtain a corresponding entry (shown in deep blue) in the output feature map at each convolution step.

And the final output feature map after completing nine stages of convolution operation will be as follows:

.fx1

3.2. The Batch Normalization Layer

In order to reduce the sensitivity to network initialization and speeding up the convolutional neural network training, batch normalization layers are used between convolutional layers and nonlinearities, as in [47]. The batch normalization operation normalizes the elements xi of the input by first calculating the mean μB and the variance over the space, time, and observation dimensions for each channel independently. Then, it calculates the normalized activation as follows:where ϵ is a constant that improves numerical stability when the variance is very small.

To allow for the possibility that inputs with zero mean and unit variance are not optimal for the operations that follow batch normalization, the batch normalization operation further shifts and scales the activation using the transformation:where the offset β and the scale factor γ are learnable parameters that are updated during network training.

3.3. Rectified Linear Unit Layer

A Rectified Linear Unit (ReLU) layer performs a threshold operation for each element of the input, where any value less than zero is set to zero.

This operation is evaluated by the following formula:

The proposed ResNet 101 model consists of 101 layers, as shown in Figure 5. Each layer consists of two convolution filters stacked together. Subsequently, a skipping connection is added after two layers. Max pooling is added in the first layer, and average pooling is added in the last layer. A fully connected network is utilized at the end of these 101 layers.

It worth mentioning that the final fully connected stage has two main binary outputs to classify the input images as males and females. Figure 6 presents a flowchart for illustrating the processes of both the training and the recognition stages.

4. Experiments and Results

To train and test the proposed IRT_ResNet classifier for thermal images, infrared thermal image dataset was required. Many researchers attempted to create thermal image datasets for machine learning applications [46]. However, in this work, two datasets have been utilized to evaluate the accuracy of the proposed model. The first dataset (D1) is found in [48] that contains 461 images. The second dataset (D2) is a larger one that consists of approximately 2907 thermal images and can be found in [49]. The description of the used datasets is shown in Table 1. Figure 7 shows samples of male and female images of both datasets. Three different networks of IRT_ResNet have been constructed and trained on these datasets. All the networks consist of the same steps in each layer; however, the number of layers differs. The first network consists of 18 layers, the second has 50 layers, and the third contains 101 layers.

In the preprocessing phase, we examine each image manually and exclude unsuitable ones. Then, we classify the selected images as males and females. Finally, we assign a unique identifier to each image, using odd numbers for females and even numbers for males in order to facilitate the recognition stage.

For the three networks, 10% of dataset images have been devoted for the recognition stage. The other 90% of images have been entered into the models to create a dataset of features that are used in both training and validating stages. These images have been divided into 40% for training and 60% for testing. MATLAB has been used for coding the three versions of the proposed IRT_ResNet. The training time and the accuracy of these models have been recorded for each dataset separately.

Moreover, other different models have been implemented to compare the output with the proposed IRT_ResNet model, CNN model with five layers, and three normal neural network models with feature selection based on PCA, scale invariant feature transform, and local binary pattern feature extraction algorithms. These algorithms have shown high accuracy in feature extraction of visual normal images as mentioned in the “Related Work.” The neural network model used for classification is a fully connected network with approximately 86K input features, one hidden layer, and an output layer of one neuron.

Our performance experiments go through two phases. In the first phase, the three IRT_ResNet models are compared in terms of their accuracy and training time on both datasets. In the second phase, a comparison between IRT_ResNet and the other three models is made in terms of their accuracy. The following subsections describe these two experimental phases.

4.1. IRT_ResNet Performance Measure

Figure 8 shows the time required to train three IRT_ResNet models. As an observation, with the increase of the layers in the models, the training time increases. This is due to the increase of the number of variables that require tuning for each new added layer. In addition, it can be noticed that the training time increases with the addition of more training data since the training loops’ size will grow.

Table 2 shows the accuracy comparison between the three studied models. It can be seen from the table that with the addition of more layers from 50 to 101, the accuracy of the model rises to 99%. However, the ratio of average accuracy is almost the same when increasing the model layers from 18 to 50. This motivated us to select ResNet 101 as the main model for the gender classification in this work. Finally, enlarging the size of the training data further enhanced the accuracy of the proposed model.

Furthermore, four other performance metrics, namely, labeled precision rate, recall rate, F-score, and overall accuracy, have been applied. They are evaluated by the following formulas [33, 47]:

Table 3 presents the results of the metrics for the IRT_ResNet 18, 50, and 101.

4.2. Other Models’ Comparison

IRT_ResNet has been compared with other models which have shown high accuracy in feature extraction of visual normal images as surveyed in the “Related Work.” These models are the CNN model with five layers and three normal neural network models with feature selection based on PCA, scale invariant feature transform (SIFT), and local binary pattern (LBP) feature extraction algorithm. For unifying the experimental environment and the used programming language, we have also coded all of the compared methods mentioned above.

Table 4 shows the classification comparison of these techniques for males and females’ images. From the recorded results, it is clear that the IRT_ResNet model obtained 100% accuracy for male classification and 94.11% for female classification, which overcomes the accuracy of the other classifiers. The reason that the females’ classifier accuracy is less than that of the male’s classifier is that the face’s structure of some males with long hair and other traits may look like that of the females even for human eyes. Figure 9 presents examples of the classification process of the proposed model, with their results according to the ambiguity of the input images.

5. Conclusion

Utilizing thermal images in gender classification is a new direction in computer vision research. In this paper, an infrared imaging gender classifier called IRT_ResNet has been proposed. Three models with different number of convolutional neural network layers have been programed and tested. These models consist of different numbers of 2D convolutional filtering layers (18, 50, and 101) with the same structure. Two different datasets have been leveraged in this work. The first dataset consists of 461 infrared thermal images. The second dataset consists of 2907 images. Both datasets have been utilized to train the models. The comparison between the three models has shown that the classification accuracy increases from 96% to 99% when raising the number of layers from 18 to 101. However, no enhancement has been recorded when the number of layers increased from 18 to 50. We conclude that ResNet 101 is sufficient for the classification process. In addition, four other efficient models of different machine learning classifiers with feature extraction preprocessing have been coded, trained, tested, and compared to the IRT_ResNet classifier. The results prove that the proposed model outperforms others and is more accurate for males than for females, reaching 100%, while its precision and F-score exceed 97%. Another conclusion is that IRT_ResNet achieves the same accuracy percentage for both used datasets regardless of their different sizes.

In the future, we will attempt to utilize IRT_ResNet model for age detection of faces in infrared thermal images to be employed in security and surveillance applications in dark and nighttime. Furthermore, it is planned to study the effect of increasing the size of tested datasets. Moreover, infrared imaging exertions can be added to upgrade the imaging capabilities of smartphone cameras to this type of imaging.

Data Availability

The data used to support the findings of this study were supplied by (1) IR can be made freely available and can be accessed from Index of/downloads/TD_IR_E (http://tdface.ece.tufts.edu/downloads/TD_IR_E/ and (2) Sciebo can be made freely available and can be accessed from IThermalfaceDatabase (https://rwth-aachen.sciebo.de/s/AoSNdkGBRCtWIzX).

Conflicts of Interest

The authors declare that they have no conflicts of interest.