|
References | Year | Representation (global/local/depth) | Classification | Modality | Level | Dataset | Performance result |
|
Yamato et al. [94] | 1992 | Symbols converted from mesh feature vector and encoded by vector quantization (G) | HMM | RGB | Action/activity | Collected dataset: 3 subjects × 300 combinations | 96% accuracy |
Darrell and Pentland [92] | 1993 | View model sets (G) | Dynamic time warping | RGB | Action primitive | Collected instances of 4 gestures. | 96% accuracy (“Hello” gesture) |
Brand et al. [102] | 1997 | 2D blob feature (G) | Coupled HMM (CHMM) | RGB | Action primitive | Collected dataset: 52 instances. 3 gestures × 17 times. | 94.2% accuracy |
Oliver et al. [97] | 2000 | 2D blob feature (G) | (i) CHMM; (ii) HMM; | RGB | Interaction | Collected dataset: 11–75 training sequences +20 testing sequences. Organized as 5-level hierarchical interactions. | (i) 84.68 accuracy (average); (ii) 98.43 accuracy (average) |
Bobick and Davis [17] | 2001 | Motion energy image & motion history image (G) | Template matching by measuring Mahalanobis distance | RGB | Action/activity | Collected dataset: 18 aerobic exercises × 7 views. | (a) 12/18 (single view); (b) 15/18 (multiple views) |
Efros et al. [10] | 2003 | Optical flow (G) | K-nearest neighbor | RGB | Action/activity | (a) Ballet dataset; (b) tennis dataset; (c) football dataset | (a) 87.4% accuracy; (b) 64.3% accuracy; (c) 65.4% accuracy |
Park and Aggarwal [103] | 2004 | Body model by combining an ellipse representation and a convex hull-based polygonal representation (G) | Dynamic Bayesian network | RGB | Interaction | Collected dataset: 56 instances. 9 interactions × 6 pairs of people. | 78% accuracy |
Schüldt et al. [105] | 2004 | Space-time interest points (L) | SVM | RGB | Action/activity | KTH dataset | 71.7% accuracy |
Blank et al. [5] | 2005 | Space-time shape (G) | Spectral clustering algorithm | RGB | Action/activity | Weizmann dataset | 99.63% accuracy |
Oikonomopoulos et al. [36] | 2005 | Spatiotemporal salient points (L) | RVM | RGB | Action/activity | Collected dataset: 152 instances. 19 activities × 4 subjects × 2 times. | 77.63% recall |
Dollar et al. [37] | 2005 | Space-time interest points (L) | (i) 1-nearest neighbor (1NN); (ii) SVM; | RGB | Action/activity | KTH dataset | (i) 78.5% accuracy (1NN); (ii) 81.17% accuracy (SVM) |
Ke et al. [38] | 2005 | Integral videos (L) | Adaboost | RGB | Action/activity | KTH dataset | 62.97% accuracy |
Veeraraghavan et al. [93] | 2005 | Space-time shape (G) | Nonparametric methods by extending DTW | RGB | Action/activity | (a) USF dataset [154]; (b) CMU dataset [155]; (c) MOCAP dataset | No accuracy data presented. |
Duong et al. [98] | 2005 | High level activities are represented as sequences of atomic activities; atomic activities are only represented using durations (−). | Switching hidden semi-Markov model (S-HSMM) | RGB | Interaction | Collected dataset: 80 video sequences. 6 high level activities. | 97.5 accuracy (average accuracy; Coxian model) |
Weinland et al. [20] | 2006 | Motion history volumes (G) | Principal component analysis (PCA) + Mahalanobis distance | RGB | Action/activity | IXMAS dataset [20] | 93.33% accuracy |
Lu et al. [49] | 2006 | PCA-HOG (L) | HMM | RGB | Action/activity | (a) Soccer sequences dataset [10]; (b) Hockey sequences dataset [156] | The implemented system can track subjects in videos and recognize their activities robustly. No accuracy data presented. |
Ikizler and Duygulu [18] | 2007 | Histogram of oriented rectangles and encoded with BoVW (G) | (i) Frame by frame voting; (ii) global histogramming; (iii) SVM classification; (iv) dynamic time warping; | RGB | Action/activity | Weizmann dataset | 100% accuracy (DTW) |
Huang and Xu [19] | 2007 | Envelop shape acquired from silhouettes (G) | HMM | RGB | Action/activity; action primitive | Collected dataset: 9 activities × 7 subjects × 3 times × 3 views. | Subject dependent + view independent: 97.3% accuracy; subject independent + view independent: 95.0% accuracy; subject independent + view dependent: 94.4% accuracy |
Scovanner et al. [46] | 2007 | 3D SIFT (L) | SVM | RGB | Action/activity | Weizmann dataset | 82.6% accuracy |
Vail et al. [106] | 2007 | — | (i) HMM (ii) conditional random field | — | Interaction | Data from the hourglass and the unconstrained tag domains generated by robot simulator. | 98.1% accuracy (CRF, hourglass); 98.5% accuracy (CRF, unconstrained tag domains) |
Cherla et al. [21] | 2008 | Width feature of normalized silhouette box (G) | Dynamic time warping | RGB | Action/activity | IXMAS dataset [20] | 80.05% accuracy; 76.28% accuracy (cross view) |
Tran and Sorokin [25] | 2008 | Silhouette and optical flow (G) | (i) Naïve Bayes (NB); (ii) 1-nearest neighbor (1NN); (iii) 1-nearest neighbor with rejection (1NN-R); (iv) 1-nearest neighbor with metric learning (1NN-M) | RGB | Interaction; Action/activity | (a) Weizmann dataset; (b) UMD dataset [15]; (c) IXMAS dataset [20]; (d) collected dataset: 532 instances. 10 activities × 8 subjects. | (a) 100% accuracy; (b) 100% accuracy; (c) 81% accuracy; (d) 99.06% accuracy (1NN-M & L1SO) |
Achard et al. [26] | 2008 | Semi-global features extracted from space-time micro volumes (L) | HMM | RGB | Action/activity | Collected dataset: 1614 instances. 8 activities × 7 subjects × 5 views. | 87.39% accuracy (average) |
Rodriguez et al. [91] | 2008 | Action MACH-maximum average correlation height (G) | Maximum average correlation height filter | RGB | Interaction; Action/activity | (a) KTH dataset; (b) collected feature films dataset: 92 kissing + 112 hitting/Slapping; (c) UCF dataset; (d) Weizmann dataset | (a) 80.9% accuracy; (b) 66.4% for kissing & 67.2% for hitting/slapping; (c) 69.2% accuracy; (d) reported a significant increase in algorithm efficiency, with no overall accuracy data presented |
Kiaser et al. [30] | 2008 | Histograms of oriented 3D spatiotemporal gradients (L) | SVM | RGB | Interaction; Action/activity | (a) KTH dataset; (b) Weizmann dataset; (c) Hollywood dataset | (a) 91.4% (±0.4) accuracy; (b) 84.3% (±2.9) accuracy; (c) 24.7% precision |
Willems et al. [39] | 2008 | Hessian-based STIP detector & SURF3D (L) | SVM | RGB | Action/activity | KTH dataset | 84.26% accuracy |
Laptev et al. [50] | 2008 | STIP with HOG, HOF are encoded with BoVW (L) | SVM | RGB | Interaction; Action/activity | (a) KTH dataset; (b) Hollywood dataset | (a) 91.8% accuracy; (b) 38.39% accuracy (average) |
Natarajan and Nevatia [95] | 2008 | 23 degrees body model (G) | Hierarchical variable transition HMM (HVT-HMM) | RGB | Action/activity; Action primitive | (a) Weizmann dataset; (b) gesture dataset in [157] | (a) 100% accuracy; (b) 90.6% accuracy |
Natarajan and Nevatia [107] | 2008 | 2-layer graphical model: top layer corresponds to actions in particular viewpoint; lower layer corresponds to individual poses (G) | Shape, flow, duration-conditionalrandom field (SFD-CRF) | RGB | Action/activity | Collected dataset: 400 instances. 6 activities × 4 subjects × 16 views (×6 backgrounds). | 78.9% accuracy |
Ning et al. [108] | 2008 | Appearance and position context (APC) descriptor encoded by BoVW (L) | Latent pose conditional random fields (LPCRF) | RGB | Action/activity; Action primitive | HumanEva dataset | 95.0% accuracy (LPCRFinit) |
Marszalek et al. [158] | 2009 | SIFT, HOG, HOF encoded by BoVW (L) | SVM | RGB | Interaction | Hollywood2 dataset | 35.5% accuracy |
Li et al. [76] | 2010 | Action graph of salient postures (D) | Non-Euclidean relational fuzzy (NERF) C-means & Hausdorf distance-based dissimilarity measure | Depth | Action/activity | MSR Action3D dataset | 91.6% accuracy (train/test = 1/2); 94.2% accuracy (train/test = 2/1); 74.7% accuracy (train/test = 1/1 & cross subject) |
Suk et al. [101] | 2010 | YIQ color model for skin pixels; histogram-based color model for face region; optical flow for tracking of hand motion (L) | Dynamic Bayesian network | RGB | Action primitive | Collected dataset: 498 instances. (a) 10 gestures × 7 subjects × 7 times (isolated gesture); (b) 8 longer videos contain 50 gestures (continuous gestures) | (a) 99.59% accuracy; (b) 84% recall & 80.77% precision |
Baccouche et al. [124] | 2010 | SIFT descriptor encoded by BoVW (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Interaction | MICC-Soccer-Actions-4 dataset [159] | 92% accuracy |
Kumari and Mitra [29] | 2011 | Discrete Fourier transform on silhouettes (G) | K-nearest neighbor | RGB | Action/activity | (a) MuHaVi dataset; (b) DA-IICT dataset; | (a) 96% accuracy; (b) 82.6667% accuracy; |
Wang et al. [51] | 2011 | Dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction; Action/activity | (a) KTH dataset; (b) YouTube dataset; (c) Hollywood2 dataset; (d) UCF Sport dataset | (a) 94.2% accuracy; (b) 84.2% accuracy; (c) 58.3% accuracy; (d) 88.2% accuracy |
Wang et al. [56] | 2012 | STIP with HOG, HOF are encoded with various encoding methods (L) | SVM | RGB | Interaction; Action/activity | (a) KTH dataset; (b) HMDB51 dataset | (a) 92.13% accuracy (Fisher vector); (b) 29.22% accuracy (Fisher vector) |
Zhao et al. [77] | 2012 | Combined representations: (a) RGB: HOG & HOF upon space-time interest points (L) (b) depth: local depth pattern at each interest point (D) | SVM | RGB-D | Interaction | RGBD-HuDaAct dataset | 89.1% accuracy |
Yang et al. [78] | 2012 | DMM-HOG (D) | SVM | Depth | Action/activity | MSR Action3D dataset | 95.83% accuracy (train/test = 1/2); 97.37% accuracy (train/test = 2/1); 91.63% accuracy (train/test = 1/1 & cross subject) |
Xia et al. [84] | 2012 | Histograms of 3D joint locations (D) | HMM | Depth | Action/activity | (a) collected dataset: 6220 frames, 200 samples. 10 activities × 10 subjects × 2 times. (b) MSR Action3D dataset | (a) 90.92% accuracy; (b) 97.15% accuracy (highest); 78.97% accuracy (cross subject) |
Yang and Tian [85] | 2012 | EigenJoints (D) | Naïve-Bayes-Nearest-Neighbor (NBNN) | Depth | Action/activity | MSR Action3D dataset | 96.8% accuracy; 81.4% accuracy (cross subject) |
Wang et al. [160] | 2012 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; Action/activity | (a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) CMU MOCAP dataset | (a) 88.2% accuracy; (b) 85.75% accuracy; (c) 98.13% accuracy |
Wang et al. [53] | 2013 | Improved dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction | (a) Hollywood2 dataset; (b) HMDB51 dataset; (c) Olympic Sports dataset [161]; (d) UCF50 dataset [162] | (a) 64.3% accuracy; (b) 57.2% accuracy; (c) 91.1% accuracy; (d) 91.2% accuracy |
Oreifej and Liu [74] | 2013 | Histogram of oriented 4D surface normals (D) | SVM | Depth | Action/activity; Action primitive | (a) MSR Action3D dataset; (b) MSR Gesture3D dataset; (c) Collected 3D Action Pairs dataset | (a) 88.89% accuracy; (b) 92.45% accuracy; (c) 96.67% accuracy |
Chaaraoui [88] | 2013 | Combined representations: (a) RGB: silhouette (G) (b) depth: skeleton joints (D) | Dynamic time warping | RGB-D | Action/activity | MSR Action3D dataset | 91.80% accuracy |
Ren et al. [152] | 2013 | Time-series curve of hand shape (G) | Dissimilarity measure based on Finger-Earth Mover’s Distance (FEMD) | RGB | Action primitive | Collected dataset: 1000 instances. 10 gestures × 10 subjects × 10 times. | 93.9% accuracy |
Ni et al. [163] | 2013 | Depth-Layered Multi-Channel STIPs (L) | SVM | RGB-D | Interaction | RGBD-HuDaAct database | 81.48% accuracy (codebook size = 512 & SPM kernel) |
Grushin et al. [123] | 2013 | STIP with HOF (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Action/activity | KTH dataset | 90.7% accuracy |
Peng et al. [31] | 2014 | (i) STIP with HOG, HOF and encoded by various encoding methods; (L) (ii) iDT with HOG, HOF, MBHx, MBHy and encoded by various encoding methods (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; (b) UCF50 dataset; (c) UCF101 dataset | Hybrid representation: (a) 61.1% accuracy; (b) 92.3% accuracy; (c) 87.9% accuracy |
Peng et al. [32] | 2014 | Improved dense trajectory encoded with stacked Fisher kernal (L) | SVM | RGB | Interaction; Action/activity | (a) YouTube dataset; (b) HMDB51 dataset; (c) J-HMDB dataset | (a) 93.38% accuracy; (b) 66.79% accuracy; (c) 67.77% accuracy |
Wang et al. [82] | 2014 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; Action/activity | (a) MSR Action3D dataset; (b) MSR DailyActivity3D dataset; (c) Multiview 3D event dataset; (d) Cornell Activity Dataset [164] | (a) 88.2% accuracy; (b) 85.75% accuracy; (c) 88.34% accuracy (cross subject); 86.76% accuracy (cross view); (d) 97.06% (same person) 74.70% accuracy (cross person) |
Simonyan and Zisserman [115] | 2014 | Spatial stream ConvNets & optical flow based temporal stream ConvNets (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; (b) UCF101 dataset | (a) 59.4% accuracy; (b) 88.0% accuracy |
Lan et al. [33] | 2015 | Improved dense trajectory with HOG, HOF, MBHx, MBHy enhanced with multiskip feature tracking (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; (b) Hollywood2 dataset; (c) UCF101 dataset; (d) UCF50 dataset; (e) Olympic Sports dataset | (a) 65.1% accuracy (L = 3); (b) 68.0% accuracy (L = 3); (c) 89.1% accuracy (L = 3); (d) 94.4% accuracy (L = 3); (e) 91.4% accuracy (L = 3) |
Shahroudy et al. [83] | 2015 | Combined representations: (a) RGB: dense trajectories with HOG, HOF, MBH (L) (b) Depth: skeleton joints (D) | SVM | RGB-D | Interaction | MSR DailyActivity3D | 81.9% accuracy |
Wang et al. [114] | 2015 | Weighted hierarchical depth motion maps (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; Action/activity | (a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) UTKinect Action dataset [84]; (d) MSR DailyActivity3D dataset; (e) Combined dataset of above | (a) 100% accuracy; (b) 100% accuracy; (c) 90.91% accuracy; (d) 85% accuracy; (e) 91.56% accuracy |
Wang et al. [165] | 2015 | Pseudo-color images converted from DMMs (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; Action/activity | (a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) UTKinect Action dataset [84] | (a) 100% accuracy; (b) 100% accuracy; (c) 90.91% accuracy |
Wang et al. [117] | 2015 | Trajectory-pooled deep-convolutional descriptor and encoded by Fisher kernal (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; (b) UCF101 dataset | (a) 65.9% accuracy; (b) 91.5% accuracy |
Veeriah et al. [125] | 2015 | (i) HOG3D in KTH 2D action dataset; (L) (ii) skeleton-based features including skeleton positions, normalized pair-wise angels, offset of joint positions, histogram of the velocity, and pairwise joint distances (D) | Differential recurrent neural network (dRNN) | RGBD | Action/activity | (a) KTH dataset; (b) MSR Action3D dataset | (a) 93.96% accuracy (KTH-1); 92.12% accuracy (KTH-2); (b) 92.03% accuracy |
Du et al. [126] | 2015 | Representations of skeleton data extracted by subnets (D) | Hierarchical bidirectional recurrent neural network (HBRNN) | RGBD | Action/activity | (a) MSR Action3D dataset; (b) Berkeley MHAD Action dataset [166]; (c) HDM05 dataset [167] | (a) 94.49% accuracy; (b) 100% accuracy; (c) 96.92% (±0.50) accuracy |
Zhen et al. [58] | 2016 | STIP with HOG3D and encoded with various encoding methods (L) | SVM | RGB | Interaction; Action/activity | (a) KTH dataset; (b) UCF YouTube dataset; (c) HMDB51 dataset | (a) 94.1% (Local NBNN); (b) 63.0% (improved Fisher kernal); (c) 30.5% (improved Fisher kernal) |
Chen et al. [81] | 2016 | Action graph of skeleton-based features (D) | Maximum likelihood estimation | Depth | Action/activity | (a) MSR Action3D dataset; (b) UTKinect Action dataset | (a) 95.56% accuracy (cross subject); 96.1% accuracy (three subset evaluation); (b) 95.96% accuracy |
Zhu et al. [87] | 2016 | Co-occurrence features of skeleton joints (D) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | Depth | Interaction; Action/activity | (a) SBU Kinect interaction dataset [168]; (b) HDM05 dataset; (c) CMU dataset; (d) Berkeley MHAD Action dataset | (a) 90.41% accuracy; (b) 97.25% accuracy; (c) 81.04% accuracy; (d) 100% accuracy |
Li et al. [116] | 2016 | VLAD for deep dynamics (G) | Deep convolutional neural networks (ConvNets) | RGB | Interaction; Action/activity | (a) UCF101 dataset; (b) Olympic Sports dataset; (c) THUMOS15 dataset [116] | (a) 84.65% accuracy; (b) 90.81% accuracy; (c) 78.15% accuracy |
Berlin & John [119] | 2016 | Harris corner-based interest points and histogram-based features (L) | Deep neural networks (DNNs) | RGB | Interaction | UT Interaction dataset [169] | 95% accuracy on set1; 88% accuracy on set2 |
Huang et al. [120] | 2016 | Lie group features (L) | Lie Group Network (LieNet) | Depth | Interaction; Action/activity | (a) G3D-Gamingdataset [170]; (b) HDM05 dataset; (c) NTU RGBD dataset [171] | (a) 89.10% accuracy; (b) 75.78% ± 2.26 accuracy; (c) 66.95% accuracy |
Mo et al. [113] | 2016 | Automatically extracted features from skeletons data (D) | Convolutional neural networks (ConvNets) + multilayer perceptron | Depth | Interaction | CAD-60 dataset | 81.8% accuracy |
Shi et al. [55] | 2016 | Three stream sequential deep trajectory descriptor (L) | Recurrent neural networks (RNN) and deep convolutional neural networks (ConvNets) | RGB | Interaction; Action/activity | (a) KTH dataset; (b) HMDB51 dataset; (c) UCF 101 dataset [172] | (a) 96.8% accuracy; (b) 65.2% accuracy; (c) 92.2% accuracy |
Yang et al. [79] | 2017 | Low-level polynormal assembled from local neighboring hypersurface normals and are then aggregated by Super Normal Vector (D) | Linear classifier | Depth | Interaction; Action/activity; Action primitive | (a) MSR Action3D dataset; (b) MSR Gesture3D dataset; (c) MSR ActionPairs3D dataset [173]; (d) MSR DailyActivity3D dataset | (a) 93.45% accuracy; (b) 94.74% accuracy; (c) 100% accuracy; (d) 86.25% accuracy |
Jalal et al. [80] | 2017 | Multifeatures extracted from human body silhouettes and joints information (D) | HMM | Depth | Interaction; Action/activity | (a) Online self-annotated dataset [174]; (b) MSR DailyActivity3D dataset; (c) MSR Action3D dataset | (a) 71.6% accuracy; (a) 92.2% accuracy; (a) 93.1% accuracy |
|