Abstract

In this work, a nonfiducial electrocardiogram (ECG) identification algorithm based on statistical features and random forest classifier is presented. Two feature extraction approaches are investigated: direct and band-based approaches. In the former, eleven simple statistical features are directly extracted from a single-lead ECG signal segment. In the latter, the single-lead ECG signal is first decomposed into bands, and the statistical features are extracted from each segment of a given band and concatenated to form the feature vector. Nonoverlapping segments of different lengths (i.e., 1, 3, 5, 7, 10, or 15 sec) are examined. The extracted feature vectors are applied to a random forest classifier, for the purpose of identification. This study considers 290 reference subjects from the ECG database of the Physikalisch-Technische Bundesanstalt (PTB). The proposed identification algorithm achieved an accuracy rate of 99.61% utilizing the single limb lead (I) with the band-based approach. A single chest lead (V1), augmented limb lead (aVF), and Frank’s lead (Vx) achieved an accuracy rate of 99.37%, 99.76%, and 99.76%, respectively, using the same approach.

1. Introduction

The aim of a biometric system is to uniquely identify or authenticate persons based on one or more behavioral and/or physiological characteristics, including the retina, fingerprint, or gait [1, 2]. Subject recognition is essential for many modern applications, which touch different aspects of our daily lives such as financial transactions, data protection, access control, entertainment, cars, and smartphones [35]. However, the current biometric traits used have different operational trade-offs in terms of performance, robustness, measurability, and detection of liveness [610]. Around three decades ago, Forsen et al. suggested the use of the electrocardiogram (ECG) as a biometric trait [11]. Biel et al.’s [12, 13] works are considered the first attempt to use ECGs for biometric purposes, considering the biometric characteristics of measurability (ease with which the characteristic is obtained), permanence (no change over time), universality (possession of the characteristic by the individual), and uniqueness (no two individuals share the same characteristic) [1417]. Since then, many researchers have proposed various ECG-based identification approaches [1, 4, 1827] using private and/or public databases [28, 29].

Biometric identification system involves three main phases: the signal denoising, feature extraction, and classification. Signal denoising [3034] is an important task, which is required due to the susceptibility of the ECG signal to noise of many sources such as power interference and electrode movement [35, 36]. Feature extraction is needed to provide unique biomarkers for a given ECG signal. Feature extraction methods can be grouped into three main categories: fiducial-based approaches which extract features while preserving the characteristics of the ECG signal, e.g., the amplitudes and intervals of heartbeats [20, 31, 3743], non-fiducial-based approaches which do not require such precise knowledge of ECG characteristics [4453], and hybrid-based approaches [54, 55].

The classifier is the last stage of a biometric identification system. Different classifiers have been used in the literature such as neural network (NN), k-nearest neighbors algorithm (k-NN), support vector machine (SVM), and random forest [30, 31, 33, 49, 5456]. Recently, deep learning has also been proposed for an ECG biometric identification system [57, 58].

In this study, we propose a new nonfiducial method for subject identification based on statistical features and random forest classifier. For feature extraction, we are proposing two approaches: direct and band-based approaches. In the first approach, eleven statistical features are extracted directly from the single-lead ECG signal and fed to a random forest classifier. While in the band-based approach, the single-lead ECG signal is first decomposed into bands, and the statistical features are extracted from each band and concatenated to form the feature vector, which is then fed to the random forest classifier.

This study uses the Physikalisch-Technische Bundesanstalt (PTB) dataset, which is a publicly available database. This database is compiled by the National Metrology Institute of Germany. It contains combinations of digitized ECGs of both normal and abnormal subjects’ recordings, which are provided for research via the link https://PhysioNet.org [29]. Fifteen concurrently measured signals are included in each record: three limb leads (I, II, and III), three augmented limb leads (aVR, aVL, and aVF), six chest leads (V1, V2, V3, V4, V5, and V6), and three Frank leads (Vx, Vy, and Vz).

The present study offers several advantages over other existing methods due to the following: (1)It uses simple statistics for feature extraction, including the mean, standard deviation, median, maximum value, minimum value, range, interquartile range, interquartile first quarter (Q1), interquartile third quarter (Q3), kurtosis, and skewness of the ECG signal. We show by the t-distribution stochastic neighbor embedding (t-SNE) algorithm that subjects’ features based on these statistics are separable, which leads to high subject identification rate. The t-SNE is a nonlinear dimensionality reduction technique, which is utilized to visualize N-dimensional feature space using a two-dimensional space [59](2)It provides extensive investigations using a reference population of 290 subjects (238 nonhealthy subjects and 52 healthy subjects) from the PTB ECG database. To the best of our knowledge, this is the largest number of subjects considered in the literature to produce results in the context of subject identification using ECG signals. Further, this study is the first to show identification results using 290 subjects from the signals of each of the 15 previously mentioned leads; see Tables 1 and 2(3)It reports high identification accuracy results for 290 (healthy and nonhealthy) subjects using features extracted from simple statistics. Specifically, it has been found that a data segment length of 7 seconds from a single limb lead (I) gives an average accuracy of 99.61% using band-based approach. While a single chest lead (V1), augmented limb lead (aVF), and Frank’s lead (Vx) give an average accuracy of 99.73%, 99.76%, and 99.76%, respectively, using the same approach

The rest of the paper is organized as follows. Section 2 describes the proposed identification method. Section 3 presents the performance evaluation results for the proposed approaches and compares them to state-of-the-art identification systems. Finally, Section 4 gives concluding remarks.

2. Method

The proposed method comprises two phases: enrollment and identification. Each phase consists of ECG signal acquisition and preprocessing and feature extraction. After enrolling all the subjects, the registered ECG signals are used to train the random forest classifier. In the identification phase, the trained model is adapted to identify the subjects. Figure 1 shows the process of the proposed method. The details of each stage are presented in the following subsections.

2.1. Data Acquisition and Preprocessing

The PTB database is constructed utilizing 15 leads, each of which measuring a specific electrical potential difference. Each signal is sampled at 1000 samples/sec with 16-bit resolution. The length of the recording session for each subject was between 31 and 120 sec. The PTB database has undergone two main preprocessing operations: detrending and inverting. The first operation is required due to the presence of some linear trend in the database signals, possibly originating from different sources (e.g., voltage fluctuations in the recording device and subject’s muscle movements), which can potentially hinder the data analysis, and thus requires removal before further processing. Detrending is achieved by subtracting from each lead the least-squares-fit straight line of data. ECG signals are upside down in some cases, thus requiring inversion. Figures 2 and 3 show the time domain of processed 5 sec I, aVR, V1, and Vx lead signals and the frequency domain for the same leads of a healthy subject (S104). The Frank lead Vx signal has the highest amplitude, as shown in the time domain, while in the frequency domain we notice that most of the energy is concentrated below 35 Hz in all leads.

2.2. Feature Extraction

We propose two approaches to extract features from the ECG signal: direct and band-based approaches. In the first approach, the preprocessed ECG signal is segmented, where statistical features are extracted from each segment to form the feature vector. While in the second approach, the preprocessed ECG signal is decomposed into bands, each signal’s band is segmented. The statistical features are then extracted from each segment. The feature vector is formed by concatenating the statistical features of each segment from all bands. Figure 4 presents the two approaches.

The normal ECG signal’s frequency spectrum ranges from 0.01 to 100 Hz, where 90% of the energy lies in the range of 0.25 Hz to 35 Hz [60]. Therefore, direct single-lead identification accuracy can be improved by considering multiple spectral components. Here, the single-lead ECG signal is decomposed into seven subbands by employing a filter bank using seven finite impulse response band-pass filters. Each filter is of band 5 Hz, as follows: 0.1-5, 5-10, …, 30-35 Hz. Figure 5 shows the frequency responses of the filters employed to perform signal decomposition.

A nonoverlapping sliding window (1, 3, 5, 7, 10, or 15 sec) is applied for partitioning the ECG data into segments. Different window sizes are used to examine the effect of segment length on the identification system, irrespective of the individual heartbeats or specific characteristics of ECG waves.

Eleven statistical features are extracted from each segment, as listed in Section 1. These features are selected to measure certain ECG signal characteristics. Note that we estimate the mean and median to measure the ECG signal central tendency. While we use the standard deviation, range, and interquartile range to measure the statistical dispersion. The kurtosis and skewness are also used to measure the sharpness of the peak and asymmetry of the ECG signal distribution, respectively. The other statistics (the minimum value, maximum value, interquartile first quarter, and interquartile third quarter) are self-explained. The definitions of these statistics and their estimation from a data record of length samples are well known and can be found in [61]. Figure 6 shows their histograms for a data segment of length 7 sec.

2.3. The Random Forest Classifier

The random forest (RF) is an ensemble learning method developed by Breiman [62] and used for classification and regression. It includes a large number of decision tree classifiers. The classification process in the decision tree can be thought of as asking a series of questions about the available data until reaching at a decision. Each tree in the forest is constructed with a randomly selected subset of the training dataset with replacement and grows without pruning. A tree consists of nodes which are either branches (have children nodes) or leafs (terminal nodes). The best split on each node in a tree is found by employing feature random selection methods [5153]. Figure 7 presents an illustrative example of splitting a node. The node has balanced samples, 20 red and 20 blue. The aim is to find the best split that generates child nodes with the least diversity which leads to a more certain decision. The figure shows three suggested splits A, B, and C that are generated by randomly selecting a set of features and a threshold value. We can see that tree C has the best split, with the set of features number 3 and threshold value of 0.23, since it produced branches with the highest certainty. The first branch has 0.77 (17 over 22) probability of the red class. The second branch has 0.83 (15 over 18) probability of the blue class. The next step of the decision tree creation process is to find the best split on both child nodes. The random forest makes decisions based on the average of the probabilities predicted by the trees. The major advantages of random forest are that it does not suffer from overfitting problem [62], produces high classification accuracy, and provides feature importance analysis [63].

The classifier undergoes two stages: training and testing. In the training phase, each tree is constructed using a sample with replacement of the training dataset. In the testing phase, each tree classifies the testing instance and a majority voting technique is used to classify the instance. Random forest has been used in various domains such as astronomy [64] and medicine [6568]. In this work, 100 decision tree classifiers are employed.

3. Results and Discussion

In this section, performance evaluation results of the proposed approaches are presented. Also, we compare the performance of the proposed approaches with the state-of-the-art PTB-based identification systems. The results are obtained using the PTB dataset, which includes 290 subjects. Six segments of different lengths (1, 3, 5, 7, 10, or 15 sec) were considered to study the effect of segment’s length on the identification process. For each subject, the feature vectors are extracted and split into two sets training and testing. The first set consists of 70% of the features to train a random forest model and the remaining 30% of the features are used in the testing step. We used three widely used metrics to evaluate the performance of the proposed approach. These metrics include accuracy, sensitivity, and specificity [69] and are denoted by Avg. Acc., Avg. Sen., and Avg. Spe., respectively.

Table 1 presents the identification performance of direct feature extraction approach using different segment lengths averaged over all 290 subjects. For each segment’s length, 15 models were created, one model for each ECG lead. By virtue of Table 1, we observe that lead I achieved the best accuracy of 92.59% using a 15-second segment length. Lead II and lead III achieved the best accuracy of 87.2% and 87.2% using a 7-second segment length. Augmented limb aVL achieved the best accuracy of 90.52% using a 7-second segment length. Chest leads V1 to V6 achieved an average accuracy more than 90% when the segment length is greater than 3 seconds. Lead V3 achieved the best accuracy of 96.26% using a 7-second segment length. Frank’s leads Vx and Vz achieved the best accuracy, which is more than 92% using a segment length greater than one second. Figure 8 presents the average accuracy of direct feature extraction approach using different segment lengths. It is worth noting that the training phase using the 7-second segment length took 24.7 sec using a machine equipped with 3.3 GHz Intel core i7-processor, while the identification process of 290 subjects took 3.2 sec on the same machine.

Table 2 presents the identification performance of band-based feature extraction approach. All limb leads achieved a minimum accuracy rate with more than 97.44% using a 1-second segment length and an accuracy rate greater than 99% using the 3- to 7-second segment lengths. The augmented limb leads achieved an using a 1-second segment length and an using a segment length greater than three seconds. Among the augmented limb leads, lead aVR achieved the best accuracy of 99.77% using a 10-second segment length. The chest leads achieved an using a 1-second segment length. Lead V1 achieved the best accuracy rate, which is 99.76% using a 7-second segment length. Leads V2 to V6 achieved the best accuracy rate, which is 99.61% using the 5- and 7-second segment lengths. Frank’s leads achieved an using a segment length greater than 1 second. Lead Vx achieved an accuracy of 99.76% using a 7-second segment length. Figure 9 presents the average accuracy of band-based feature extraction approach using different segment lengths. The training phase in this approach using the 7-second segment length took 88.1 sec using a machine equipped with 3.3 GHz Intel core i7-processor, while the identification process of 290 subjects took 4.5 sec on the same machine.

Figure 10 shows the confusion matrix of 290 subjects using the band-based approach with limb lead I signal of length 7 sec. We plot the confusion matrix in the form of a 3D figure to make it easier to visualize where the confusion and correct identification appear. Specifically, we observe from Figure 6 that using the band-based approach with limb lead I, all the subjects achieved 100% sensitivity except four subjects: S109 (), S141 (), S184 (), and S262 (). Twenty percent of the testing segments of subjects S109, S141, and S262 were misclassified with subjects S35, S97, and S219, respectively, while twenty percent of the testing segments of subject S184 were misclassified with subject S103, and also twenty percent of the testing segments of the same subject (S184) were misclassified with S119.

The results of Figure 10 can be confirmed by investigating the separability of subjects using the t-SNE algorithm. Figure 11 shows the results of the t-SNE algorithm when it is applied to the dataset of the following ten subjects with 16 segments each: S35, S50, S97, S100, S109, S119, S141, S184, S219, and S262. The t-SNE algorithm visualizes the 77 dimensional space features of the band-based approach using a two-dimensional (2D) space. Therefore, the algorithm represents the feature vector of each segment by a single point in a 2D space.

Figure 11 shows the clusters of subjects’ segments. Note that the clusters of subjects S50 and S100 are well separated from other subjects’ clusters. However, the cluster of S35 has overlap with S109. Similar observations can be seen for the subjects S97, S109, S119, S141, S184, S219, and S262, which explains the misclassification revealed previously by the confusion matrix.

Table 3 shows the performance of the proposed approaches in comparison to the results of the state-of-the-art subject identification methods, which are available in literature and utilizing the PTB dataset. In the table, we list the reference, year of publication, number of subjects considered for identification, the segment’s length (if available), the sensitivity, and the method of identification used. Referring to Table 3, it is worthy of noting that the proposed approaches have been evaluated using 290 subjects, which is the largest number considered in the literature up to date. Further, the band-based approach, which is evaluated using such a large number of subjects and utilizing simple statistical features, has demonstrated performance greater than 99%, which makes it very attractive for practical applications. Note that the method of Wang et al. [76] is the closest in performance to our proposed method but considered only 100 subjects for identification. Further, it adopts the sparse coding which requires optimization involving norm, which is an NP hard problem.

4. Conclusion

This paper presents an ECG-based identification system that relies on statistical features and random forest classifier. Two feature extraction approaches are investigated: direct and band-based approaches. In the direct approach, the ECG signal is segmented and eleven statistical features are extracted from each segment to form the feature vector. In the second approach, the ECG signal is decomposed into seven bands, where the feature vector is formed by concatenating the statistical features extracted from each band’s segment. Six segment lengths are examined: 1, 3, 5, 7, 10, and 15 sec. The data is split into training and testing datasets. The feature vectors of the former are used to train the classifier (random forest) during the identification stage; the trained classifier is then tasked with identifying the subject using the testing data. The proposed method was evaluated using 290 reference subjects in the PTB database. Using the band-based feature extraction approach, the identification system achieved an accuracy rate of 99.61% utilizing a single limb lead (I). While a single chest lead (V1), augmented limb lead (aVF), and Frank’s lead (Vx) achieved accuracy rates of 99.37%, 99.76%, and 99.76%, respectively. It is known that variance in physical, mental, or emotional stimulation levels affects heart rate. Unfortunately, the ECG signals in the PTB dataset are recorded under the same conditions. Therefore, evaluating the proposed identification system under the effect of these stimulations will be the topic of our future work.

Data Availability

The data used to support the findings of this study are available on physioNet.org [29].

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by King Saud University through the Researchers Supporting Project number RSP-2019/46.