Abstract

In this research work, we propose a method for human action recognition based on the combination of structural and temporal features. The pose sequence in the video is considered to identify the action type. The structural variation features are obtained by detecting the angle made between the joints during the action, where the angle binning is performed using multiple thresholds. The displacement vector of joint locations is used to compute the temporal features. The structural variation features and the temporal variation features are fused using a neural network to perform action classification. We conducted the experiments on different categories of datasets, namely, KTH, UTKinect, and MSR Action3D datasets. The experimental results exhibit the superiority of the proposed method over some of the existing state-of-the-art techniques.

1. Introduction

The rapid growth in hardware and software technologies has resulted in continuous generation of a huge amount of video data through video capturing devices such as smartphones and CCTV camera. Also, a large amount of video content is being uploaded to YouTube every minute. Therefore, it is very important to extract useful information from these huge video databases and to recognize high-level activities for various applications such as automated surveillance systems, human-computer interaction, sports video analysis, real-time patient/children monitoring, shopping-behavior analysis, and dynamical systems [1]. Hence, human action recognition (HAR) from videos is an active area of research as it attracted the attention of several researchers in recent years.

Human action recognition focuses on detecting and tracking people, in particular, understanding human behaviors from a video sequence. The research in this area focuses mainly on the development of techniques for an automated visual surveillance system. It requires a combination of computer vision and pattern recognition algorithms. However, in the literature, activity, behavior, action, gesture, and ‘primitive/complex event’ are frequently used to describe essentially the same concepts. HAR is challenging because of the intraclass variation and interclass similarity. The same activity may vary from subject to subject, known as the intraclass variation. Without the contextual information, different activities may look similar, which leads to interclass variation, for example, playing and running. There are many challenges in HAR, such as multisubject interactions, group activities, and complex visual background.

The two main approaches used for HAR are based on global descriptors and local descriptors. The local descriptors are robust to noise and can be applied to a wide range of action recognition problems. However, in recent years, the skeleton-based approaches have been widely used due to the availability of depth sensors. Several datasets are available for the evaluation of action recognition algorithms. They vary in terms of the number of classes, sensors used, duration of action, view point, complexity of action performed, and so on. In this work, we address the problem of action recognition using skeleton-based approach.

Contributions: (a) We propose a method for human action recognition based on encoded joint angle information and joint displacement vector. (b) A neural network-based method to perform score-level fusion for action classification is proposed. (c) We experimentally show that the proposed method can be applied on datasets containing the skeletal joint information acquired using Kinect sensors and also on datasets where explicit pose estimation needs to be done. Thus, the proposed method can be used with a vision-based sensor or Kinect sensor.

The rest of the paper is organized as follows. Section 2 gives an overview of the existing techniques for human action recognition. Section 3 describes the proposed approach. The experimental results are demonstrated in Section 4. The conclusions and discussions are given in Section 5.

2. Review of Existing Techniques

Human activities can be broadly classified into four categories: gestures, actions, interactions (with objects and others), and group activities. Early approaches developed in 1990s mainly focused on identifying gestures and simple actions based on motion analysis. A detailed review of motion analysis-based techniques is presented by Aggarwal and Cai [2]. However, the motion analysis-based methodologies were found to be less robust as they were insufficient to describe human activities containing complex structures. Therefore, an improved approach was discussed by Aggarwal and Ryoo [3], who focused on methodologies to perform high-level activity recognition designed for the analysis of human actions, interactions, and group activities.

Ben-Arie et al. [4] have proposed a technique to perform human action recognition by computing a set of pose and velocity vectors for body parts such as hands, legs, and torso. These features are stored in a multidimensional hash table to achieve indexing- and sequence-based voting. Kellokumpu et al. [5] proposed another approach based on texture descriptor by combining motion and appearance cues. The movement dynamics are captured using temporal templates, and the observed movements are characterized using texture features. A spatiotemporal space is considered, and the human movements are described with dynamic texture features. Also, the use of motion energy features for human activity analysis is presented by Gao et al. [6]. The motion energy template is constructed for the video using a filter bank, and the actions are classified using SVM. Xu et al. [7] have proposed a hierarchical spatiotemporal model for human activity recognition. The model consists of a two-layer hidden conditional random field (HCRF), where the bottom layer is used to describe the spatial relations in each frame, and the top layer uses high-level features for characterizing the temporal relations throughout the video sequence. The bottom layer also provides high-level semantic representations. A learning algorithm is used, and human activities are identified. To improve the robustness of action recognition task, a combination of features consisting of dense trajectories and motion boundary histogram descriptors has been used by Wang et al. [8]. The descriptor captures different kinds of information such as shape, appearance, and motion to address the problem of camera motion.

The deep learning models gained popularity because of their superior performance in the field of pattern recognition and computer vision research. A review by Guo et al. [9] highlights the important developments in deep neural models. Ji et al. [10] proposed a 3D CNN model for human action recognition. The features are extracted from both spatial and temporal dimensions using 3D convolutions, thus capturing discriminative features. In another work, Wang et al. [11] proposed a technique where the spatiotemporal information obtained from 3D skeleton sequences is encoded into multiple 2D images forming Joint Trajectory Maps (JTMs), and ConvNets are applied to accomplish the action recognition task. As Joint Distance Maps (JDMs) describe texture features which are less sensitive to view variations, Li et al. [12] have developed an approach for action recognition by encoding spatiotemporal information of skeleton sequences into color texture images. Then, using convolutional neural networks, the discriminative features are obtained from the JDMs for achieving both single-view and cross-view action recognition. Hou et al. [13] have proposed a method for effective action recognition based on skeleton optical spectra (SOS), where discriminative features are learned using convolutional neural networks (ConvNets). The spatiotemporal information of a skeleton sequence is effectively captured using skeleton optical spectra. This method is more suitable in case of limited annotated training video data. Wang et al. [14] have presented a detailed survey of recent advances in RGB-D based motion recognition using deep learning techniques. In another approach, Rahmani et al. [15] have developed an improved version of deep learning model based on nonlinear knowledge transfer model learning, achieving invariance to viewpoint change. A general codebook is generated using k-means to encode the action trajectories, and then the same codebook is used for encoding action trajectories of real videos. Li et al. [16] have used multiple deep neural networks to achieve multiview learning for three-dimensional human action recognition. These multiple networks help to effectively learn the discriminative features and also capture spatial and temporal information. The recognition scores of all views are combined using multiply fusion. Xiao et al. [17] have introduced an end-to-end trainable architecture-based model for human action recognition. The model consists of deep neural networks and attention models for learning spatiotemporal features from the skeleton data. Li et al. [18] have proposed an approach for skeleton-based human action recognition. A deep model, namely, 3DConvLSTM, is used to learn spatiotemporal features from the video sequences, and an attention-based dynamic map is built for action classification.

An approach for online action recognition has been proposed by Tang et al. [19] based on weighted covariance descriptor by considering the importance of frame sequences with respect to their temporal order and discriminativeness. The combination of nearest neighbour search and Log-Euclidean kernel–based SVM is used for classification. In another work, an optical acceleration-based descriptor has been used by Edison and Jiji [20] for human action recognition. Two descriptors have been computed for effectively capturing the motion information, namely, the histogram of optical acceleration and histogram of spatial gradient acceleration. An approach based on rank pooling method was introduced by Fernando et al. [21] for action recognition, which is capable of capturing both the appearance and the temporal dynamics of the video. A ranking function generated by the ranking machine provides important information about actions. In another work, Wang et al. [22] have presented a technique for action recognition based on order-aware convolutional pooling, focusing mainly on effectively capturing the dynamic information present in the video. After extracting features from each video frame, a convolutional filter bank is applied to each feature dimension, and then filter responses are aggregated. Hu et al. [23] introduced a new approach for early action prediction based on soft regression applied on RGB-D channels. Here, the depth information is considered to achieve more robustness and discriminative power. Finally, Multiple Soft labels Recurrent Neural Network (MSRNN) model is constructed, where feature extraction is done based on Local Accumulative Frame Feature (LAFF). Some more approaches for action recognition can be found, which have been developed based on sparse coding, Yang and Tian [24]; exemplar modeling, Hu et al. [25]; max-margin learning, Zhu et al. [26]; Fisher vector, Wang and Schmid [27]; and block-level dense connections, Hao and Zhang [28]. Through literature survey, it is found that several techniques have been proposed for human action recognition. A detailed review on action recognition research is reported by Ramanathan et al. [29], Gowsikhaa et al. [30], and Fu [31].

A lot of approaches are available in the literature for human action recognition. Most of the existing techniques use either the local features extracted temporally or the skeleton representation of the human pose in the temporal sequence. However, the combination of temporal features and spatial features provides better recognition rate. In this direction, we propose a method to recognize human action based on the combination of appearance and temporal features at the classifier decision level.

3. Proposed Work

In this work, we propose a method for human action recognition by considering the structural variation feature and the temporal displacement feature. The proposed method extracts features from the pose sequence in a given video. Figure 1 depicts the methodology of the proposed system. We extract the structural variation feature by detecting the angle made between the joints during an action. There are several methods available to estimate the pose. Some of the pose estimation techniques found in the literature are based on sensor readings, and other methods are based on vision-based techniques.

3.1. Pose Estimation for Action Recognition

The OpenPose library [32, 33] is one of the well-known vision-based libraries used to extract the skeletal joints. The performance of the OpenPose library to detect the joint locations is limited when compared to sensor based methods. It uses VGG-19 deep neural network model to estimate the pose. The COCO model [34] consists of 18 skeletal joints, whereas the BODY_25 model gives 25 skeletal joint locations. In our experiments, we have used OpenPose to estimate the pose for the KTH dataset; however, for the other datasets, the pose information is taken from sensor readings. In the following section, we present the idea of structural feature extraction.

3.2. Structural Variation Feature Extraction

Let us consider the skeleton represented by a set of points having joints, where indicating the estimated joint location in the image location. Our goal is to obtain the angle between the joint and a set of joints which contributes to structural variation in the skeleton. In a video having frames, the angle is found, where represents the angle between the joints and in the frame, where . For each joint , a binary vector is computed usingwhere is given by

The procedure followed to fix the threshold “T” is given in Section 3.2.1. The feature vectors , for , are concatenated to obtain the structural feature vector . The dimension of is .

3.2.1. Feature Extraction Based on Angle Binning

The vector does not provide the variation in the angle at a finer level, as it is binarized with a single threshold value. Accordingly, we perform angle binning, where multiple thresholds are used in (3) to quantize the angle to a b-bit number, by modifying (2). This captures the angle between the joints at a finer level; at the same time, the quantization helps to suppress minute variations in the angle during an action. The process of feature extraction is shown in Figure 2.where . The terms and are defined using

The temporal variation feature captures the dynamics of individual joint by tracking them through the frames. This process is explained in the following section.

3.3. Temporal Feature Extraction

The temporal feature extraction looks at the change in the joint location for a joint from the frame to . We consider the location of the joint in two successive frames to find the relative position of the joint. This is effectively the tracking of the joint location. A histogram of 2D displacement orientation of joint location in the XY plane is constructed to capture the temporal dynamics. The vector representing the joint displacement is computed for the sequence of video frames. For each displacement vector of joints, we obtain the orientation pair consisting of orientation angle and the magnitude represented by . The orientation angles , , for all the joints are used as the temporal features. The for joint at time instance is computed using

The displacement vector of joint at time instance is calculated using

The feature vector for every joint location is given by

A -bin histogram is created for every joint from the feature vector . This is concatenated to form a temporal feature vector representing an action. It is clear that the joint locations are sparse when compared to the traditional optical flow-based methods. Thus, the feature extraction process is computationally more efficient.

3.4. Score-Level Fusion Using Neural Network

We combine the structural features and the temporal features at the score level. For every sample , the classifier assigns a score ranging from to . The score is the signed distance of the observation to the decision boundary. A positive score indicates that the sample belongs to class . A negative score gives the distance of from decision boundary. The score-level fusion is performed using a neural network. The neural network assigns significance scores to the classifiers based on structural and temporal features. The structural features are less discriminative for describing actions having similar body part movements such as walk, run, and jogging. The optimal fusion of temporal and structural features would help in better recognition.

To generalize the classifier fusion, we consider a multiclass classification problem with classes and classifiers. In our case, we have used scores from two SVM classifiers for fusion. The class prediction score for a sample from classifier iswhere each is a prediction score corresponding to the class . The input to the neural network for the sample is given bywhere .

The predicted label at the output layer of the neural network is given by . To get the optimal fusion score, we need to solve the objective function given in (11) for the training samples in the action recognition dataset.where is the actual label at the output layer for the sample .

For a neuron , in the hidden layer , the output of the neuron is given bywhererepresents the synaptic weights from the previous layer to the neuron , and is the sigmoid activation function. For a neuron at the output layer, the predicted label is given bywhererepresents the synaptic weights from the last hidden layer to the output neuron . is the input from last hidden layer. is the softmax function. The output of this layer, , for a sample , is given by

The neural network uses the backpropagation algorithm to learn the network parameters. An example of neural network architecture used in the proposed model is shown in Figure 3.

4. Experiments and Results

To demonstrate the performance of the proposed model, we carried out experiments on three publicly available datasets, namely, KTH [35], UTKinect [36], and MSR Action3D dataset [37]. The KTH dataset requires explicit pose estimation. However, UTKinect dataset contains the pose information captured using Kinect sensors. The source code of our implementation is available at https://github.com/muralikrishnasn/HARJointDynamics.git.

4.1. Datasets

The KTH dataset contains six action types performed by 25 subjects under four different conditions. The skeletal joint information is not included in the dataset unlike other datasets used in the experiment. The UTKinect dataset is acquired using a Kinect sensor. The dataset contains skeletal joint information for 10 types of actions performed by 10 subjects repeated twice per action. The MSR Action3D dataset contains skeleton data for 20 action types, performed by 10 subjects, where each action is performed 2 to 3 times. The dataset contains 20 joint locations per frame captured using a sensor similar to Kinect device.

4.2. Experimental Setup and Results

In our experiments, OpenPose library [33] is used to estimate the pose for the KTH dataset. A pretrained network with BODY_25 model is used in our experiments. The parameters of the experiment have been set as described in [35]. The deep neural network to detect the joints is executed on a Tesla P100 GPU. The Support Vector Machine (SVM) classifiers are used to extract the structural and temporal features. The predicted scores from these SVM classifiers are combined using a neural network. We used radial basis function kernel in the SVM classifiers. A simple feed-forward network with sigmoid function at the hidden layers and softmax output neurons is used to solve (11). In the experiments, the neural network has been trained with 50 epochs. A plot of epochs versus cross-entropy is shown in Figure 4. The results of the experiments are shown in Figures 5(a)5(c), summarizing the confusion matrices for structural features, temporal features, and the score-level fusion, respectively. It can be seen that the misclassifications are between highly similar actions like running and jogging. The proposed model has achieved an accuracy of on the KTH dataset.

We have conducted experiments on UTKinect dataset in a similar manner to that shown in [36, 44]. The confusion matrix considering the structural features is presented in Figure 5(d). The results for temporal features and score-level fusion using neural network are shown in Figures 6(a) and 6(b). The accuracy of the proposed method on the UTKinect dataset is with a deviation of .

The experiment on MSR Action3D dataset has been conducted using cross-subject test as described in [37] unlike the leave-one-subject-out cross-validation (LOOCV) method given in [40]. The actions are grouped into three subsets: AS1, AS2, and AS3. The AS1 and AS2 have less interclass variations, whereas AS3 contains complex actions. The obtained results are listed in Table 1. A summary of the results from all the three datasets is reported in Table 2. From Table 2, it is observed that the proposed method outperforms the existing methods for human action recognition. We used similar classifier settings for the other datasets in the experiment. The experimental results show that the proposed method outperforms some of the state-of-the-art techniques for all the three datasets considered in the experiments. For the MSR Action3D dataset, our method gives an accuracy of with a deviation of , which is better than the listed methods in Table 2 by more than . However, the fusion of classifiers shows better performance than the single classifier.

4.3. Influence of Quantization Parameter b and Histogram Bins k on Accuracy

The performance of SVM classifier-1 shown in Figure 1 is analyzed by varying the quantization parameter b. The number of bits b used in quantization versus accuracy is plotted in Figure 7. It is observed that the parameter has no influence on the results beyond b = 8 for KTH and MSR Action3D datasets. However, the optimal value of b for UTKinect dataset is 16. This is due to the variations in the range of data values for the location coordinates.

A plot of number of bins k in joint displacement feature versus the accuracy is shown in Figures 812. The displacement vectors provide complementary information to joint angles. Most of the pose estimation algorithms fail to detect the joints that are hidden due to occlusion or self-occlusion. Normally, the pose estimation algorithms result in a zero value for such joint locations. These hidden joint locations act as noise and may degrade the performance of the action recognition algorithm.

4.4. Analysis of Most Significant Joints

In KTH dataset, the hand-waving action is mainly due the movement of joints to . The other joints do not contribute to the action. The most important joints involved in an action are depicted in Figure 13. It can be observed that actions walking, running, and jogging have similar characteristics in terms of angular movements. This is very useful in identifying any outliers while detecting abnormalities in actions. (Dominant joints with respect to angular movement for other datasets are included in Figures 1417).

The accuracy of the proposed system has been analyzed using two types of combiners: a trainable combiner using a neural network and a fixed combiner using score averaging [45]. This is shown in Figure 18. The neural network is a better combiner as it is able to find the optimal weights for the fusion, whereas score averaging works as a fixed combiner with equal importance to both classifiers showing lower accuracy. The neural network-based fusion enhances the performance in terms of accuracy. It can be seen from Figures 1923 that the fusion technique results in better performance.

The correlation analysis is performed on the output of two SVM classifiers. The result is listed in Table 3. The analysis shows that the average correlation is less than 0.5. This indicates that the classifiers moderately agree on the classification. Consequently, the fusion of these scores leads to improvement in the overall accuracy of the system.

5. Conclusions

We have developed a method for human action recognition based on skeletal joints. The proposed method extracts structural and temporal features. The structural variations are captured using joint angle, and the temporal variations are represented using joint displacement vector. The proposed approach is found to be simple as it uses single-view joint locations and yet outperforms some of the state-of-the-art techniques. Also, we showed that, in the absence of Kinect sensor, pose estimation algorithm can be used as a preliminary step. The proposed method shows promising results for action recognition tasks when temporal features and structural features are fused at the score level. Thus, the proposed method is suitable for robust human action recognition tasks.

Data Availability

The references of the datasets used in the experiment are provided in the reference list.

Conflicts of Interest

The authors have no conflicts of interest regarding the publication of this paper.