Abstract

A new approach was proposed to improve traditional background subtraction (BGS) techniques by integrating a gradient-based edge detector called a second derivative in gradient direction (SDGD) filter with the BGS output. The four fundamental BGS techniques, namely, frame difference (FD), approximate median (AM), running average (RA), and running Gaussian average (RGA), showed imperfect foreground pixels generated specifically at the boundary. The pixel intensity was lesser than the preset threshold value, and the blob size was smaller. The SDGD filter was introduced to enhance edge detection upon the completion of each basic BGS technique as well as to complement the missing pixels. The results proved that fusing the SDGD filter with each elementary BGS increased segmentation performance and suited postrecording video applications. Evidently, the analysis using F-score and average accuracy percentage proved this, and, as such, it can be concluded that this new hybrid BGS technique improved upon existing techniques.

1. Introduction

Object extraction is a technique used in suppressing the background of a video scene to detect subjects that appear in the frame. The technique involves comparing or subtracting the current frames from the background frame and treating the remaining pixels as foreground [1]. Prior research on background subtraction (BGS) used several parametric BGS techniques, such as running average [24], running Gaussian average [57], approximate median filter [7, 8], and Gaussian Mixture Model [911]. These parametric techniques determine the foreground and update the subsequent background based on the distribution of intensity value [12]. Aside from these techniques, other studies have introduced nonparametric models that detect foreground and background based on the intensity of statistical properties [13]. Other non-parametric models include a kernel density estimator [14] and mean shift estimation [15].

This work focuses on basic BGS techniques, such as frame differencing, approximate median, running average, and running Gaussian average. The motivation of this work lies in the fact that most edge pixels are undetected after performing object extraction techniques based on FD, AM, RA, and RGA. In this study, however, we have overcome this limitation by detecting all edge pixels; hence a perfect blob can be retrieved through morphological procedures.

This is done by applying an SDGD filter on the results of background suppression and combining the foreground pixels generated by BGS techniques with the detected edge as our extracted object. The edge pixels are expected to fill in the boundary gap, which creates better connections among pixels in the boundary. This leads to better foreground frame detection.

This paper is organized as follows. Section 2 presents an overview of several basic BGS techniques and the SDGD filter. Section 3 describes the methodology. Section 4 discusses the results. Finally, Section 5 concludes our paper.

2. Literature Review

This section provides a review of the literature on the four BGS techniques evaluated in this study, namely, frame differencing, approximate median, running average, and running Gaussian average. SDGD filter studies are also presented.

2.1. Frame Differencing

Frame differencing (FD) is the most fundamental technique in BGS. FD involves finding the absolute difference between the current frame and a previous or background frame [1]. The absolute difference is then compared with an appropriate threshold value, , to detect the object as shown in (1), where is the current frame intensity value, is the background intensity value, and is the foreground intensity value. This technique uses the same background frame for all video sequences:

2.2. Approximate Median Filter

The approximate median (AM) algorithm is adaptive, dynamic, nonprobabilistic, and intuitive [8]. AM is obtained by calculating the difference between two video frames and using this difference in determining the perfect method for updating the background. AM is considered one of the most acceptable methods because it provides the most accurate pixel identification.

Several studies have evaluated the efficacy of the AM algorithm. He et al. [7] tested the effectiveness of the AM algorithm as part of their optimized algorithm for vehicle detection in an embedded system. Their approach yields highly accurate information with less computational time when detecting and tracking vehicles in a traffic scene. Equation (2) presents AM updates in the reference frame of every video sequence. The succeeding background frame, , is dependent on the intensity value of both the present frame, , and background frame, :

2.3. Running Average

Running average (RA) is another technique for updating a background image. A pixel is classified as background when the pixel value belongs to the corresponding distribution of the background model, and if otherwise, the mean of the distribution is updated [4]. The updated image is then used in the changing scene. The computational effect in an RA background is lesser because only the weighted sums of the two images are computed, and thus, low computational and space complexities are produced [3]. Moreover, several researchers have utilized this method to detect moving objects in video captured by a static camera.

Several studies were conducted to enhance the efficiency of BGS based on the RA method [24]. The outcome by Park et al. in [4] showed that application of a hierarchical data structure significantly increased the processing speed with accurate motion detection. This outcome can be attributed to the updating of the background frame by the RA method. Equation (3) shows a specified learning rate based on the previous background frame, where is the learning rate and is the threshold value:

2.4. Running Gaussian Average

This method combines both the Gaussian function and RA. Overall, the running Gaussian average (RGA) method has a significant advantage over other approaches because it requires less processing time and utilizes less memory compared with nonrecursive methods such as mixture of Gaussian (MoG) and kernel density estimation (KDE) [9]. Equation (4) shows how the reference frame that is represented by the mean is updated in each video sequence using this method. Unlike AM and RA, this method uses as a threshold value:

2.5. Second Derivative in a Gradient Direction (SDGD) Filter

In image processing studies, researchers use first- and second-order derivatives to detect the edge of an object based on its gradient. By using the first derivative, the edge location is defined at the maximum position of the steep and the descent [16]. Traditional edge detection methods, such as those by Prewitt, Sobel, and Roberts, convolved the image with a specific kernel [16, 17]. However, these techniques were reported to be sensitive to noise and inaccurate [17]. In 1986, the Canny edge detector was introduced, which represented an improvement over the traditional methods [17, 18]. The detector applied Gaussian smoothing to reduce noise, unwanted details, textures, nonmaximum suppression, and hysteresis thresholding to find the edges [19].

The second-order derivative approach defines the edge pixels based on changes in brightness or zero crossing in the image area [19, 20]. SDGD is a nonlinear operator that can be expressed in the first and second derivatives. Additionally, similar to Canny, SDGD is combined with a Gaussian low pass filter for smoothing purposes [21]. Moreover, a Laplace operator is used to simplify the SDGD operation [22].

The Laplacian is defined as where and are the second derivative filters.

The basic versions in the second derivative filters are given by

Associating (5) with the Gaussian filter yields where is the Gaussian low pass filter.

Five partial derivatives are used in the SDGD filter, as follows: Therefore, A detailed explanation of SDGD can be found in [20, 21, 23].

In [19, 24], SDGD was presented as a filter used in finding edges and measuring objects. Several studies utilize the SDGD filter. For example, Aarnick et al. [25] used this filter in analyzing the ultrasound images of male kidneys and prostate in their study on preprocessing algorithms for edge detection at multiple resolution scales. Their study [25] reported that detecting the contour of objects in grey medical images could be improved by applying an adaptive filter size in SDGD. Nader El-Glaly [24] used SDGD as part of her work in a digital inpainting algorithm. Hagara and Moravcik [23] introduced the PLUS operator, a combination of the SDGD filter and Laplace operator for edge detection. Similar findings were obtained by using PLUS and SDGD filters based on a kernel size of nine or lower, and PLUS yielded better results when the kernel size was greater than nine and was suitable in locating the edges of small objects. Similarly, Verbeek and Van Vliet [22] compared Laplace, SGDG, and PLUS derived from 2D and 3D images. Their research confirmed the findings in [23].

The idea of combining two methods in one algorithm was inspired by a study made by Zheng and Fan [3], where RA and temporal difference were combined to detect the moving object. Another example of hybrid research in BGS was conducted by Lu and Wang [26]. They crossbred optical flow and double background filtering to detect the moving object. Zaki et al. [27] combined frame differencing with a scale invariant feature detector to detect the moving object in various environments.

Based on the study of Persoon et al. [19], the SDGD filter gave better surface localization, especially in highly-curved areas, compared with the Canny edge detection technique. Thus, we adopted this filter in our present work. In addition, Persoon et al. showed that SDGD guaranteed minimal detail smoothing that led to better visualization of polyps in computed tomography (CT) scan data. This finding is aligned with our results reported in [28]. Further, the study by Nader El-Glaly [24] used the SGDG filter in developing an enhanced partial-differential equation-based digital inpainting algorithm to find the missing data in digital images.

To the best of our knowledge, this study is new because no prior work exists that integrates the SDGD filter with the BGS technique. Although Al-Garni and Abdennour used edge detection and the FD technique to find moving vehicles, no information was provided on the edge detection technique they utilized [29]. We used an SDGD filter to enhance the performance of the existing background subtraction technique by combining foreground pixels generated by BGS techniques with the detected edge as our extracted object. The edge pixels are expected to fill in the boundary gap, which will create better connections among pixels in the boundary. This research is an extension of our previous work published in [28] which uses more data from a variety of data sources and more detailed analysis is being done.

3. Methodology

This section discusses the database used and the proposed method.

3.1. Dataset

This study utilized datasets that were acquired from selected prerecorded video collections of several online databases.(a)Smart Engineering System Research Group (SESRG) UKM Collections. This video collection consists of various human actions and activities recorded by students involved in SESRG studies on smart surveillance systems. Besides humans, this database also has a collection of moving cars that are used as nonhuman samples for classification that will be explained further in Section 3.4.(b)CMU Graphic Lab Motion Capture (MoCap) Database [30]. This database, which is owned by Carnegie Mellon University, contains 2,506 trials in 6 categories and 23 subcategories. The videos were recorded in an indoor environment. (c)CMU Motion of Body (MoBo) Database [31]. This database, which is also owned by Carnegie Mellon University, consists of videos showing six different angles of the subject who is walking on a treadmill. (d)Multicamera Human Action Video Data (MuHAVi) [32]. This database is owned by the Digital Imaging Research Centre at Kingston University. It presents 17 action classes with 8 camera views. (e)Human Motion Database (HMDB51) [33]. This database consists of collections of edited videos from digitized movies and YouTube. The collection contains 51 action categories with 7,000 manually annotated clips.

The initial background for videos taken from databases (a), (b), (d), and (e) was modeled using the median value from a selected frame interval. We used five frames from the video sequences with 10 intervals between the frames, that is , , , , and . Next, the median value of these selected frames was used as the reference image. Details of this technique are presented in [34]. Meanwhile, database (c) provided the background reference frame. All datasets except the MuHAVi videos were manually segmented to obtain the ground truth.

3.2. The Proposed Algorithm

The methodology of our proposed techniques is as described in the flowchart shown in Figure 1.

First, the dataset was tested using the following basic parametric BGS techniques: FD, RA, AM, and RGA. Next, the SDGD filter was fused with each technique by combining the output of a background technique with the SDGD filter output. The SDGD filter was selected as a segmentation tool because it produced better results compared to other edge detection techniques (Sobel, Canny, and Roberts) [19, 28].

3.3. Evaluation Method

To evaluate the performance of each BGS technique, we calculated the average of recall, precision, score, and accuracy.(a) Recall (Rcl) refers to the detection rate, which is calculated by comparing the total number of detected true positive pixels with the total number of true positive pixels in the ground truth frame [35]. This is also known as sensitivity. The following equation shows how recall is calculated: where FP is the false positive and FN is false negative.(b) Precision (Prcsn) is the ratio between the detected true positive pixels and the total number of positive pixels detected by the method [35, 36]. This is also known as specificity: where FP is false positive, FN is false negative, TP is true positive, and TN is true negative.(c)-measure or balance -score is the weighted harmonic mean of recall and precision. It is used as a single measurement for comparing different methods [35, 36]: (d) Accuracy is the percentage of correct data retrieval. It is calculated by dividing the number of pixels with true positive plus pixels with true negative pixels over the total number of pixels in the frame. The following equation displays the calculation of accuracy [36]:

This study utilized videos with a multiple number of frames. Hence, we presented a comparison of the average F score and average accuracy percentage as the overall performance benchmark in each BGS technique with and without SDGD filter. The following equation is used to calculate the percentage of improvement:

3.4. Classification

Using the artificial neural network (ANN) as classifier, the segmented image from the proposed technique was subjected to classification testing. Training input of the ANN was extracted from 1,500 randomly chosen segmented frames/images. 750 human blob images represented human samples, and another 750 car blob images represented non-human samples.

A scale conjugate gradient and the back propagation rules were chosen to train the classifier. The ANN was designed with one hidden layer containing ten hidden neurons and an output layer containing two hidden neurons. Both layers used sigmoid as the activation function and the mean squared error value as the performance function. Images were classified as either human or non-human. Next, another set of 1,000 frames/images were chosen for testing. We applied leave-one-out cross validation in our study. Ten experiments and the average classification rate were used to evaluate whether the recognition performance was human or not.

The evaluation was based on the segmented frames/images generated from the enhanced FD technique, with the SDGD filter added to the algorithm. In this experiment, we only used the FD instead of other BGS techniques because it has the fastest processing time. We also performed statistical tests such as recall, precision, and F-score on the obtained classification result by taking positive identification on the classification of human and negative for the classification of non-human.

4. Results and Discussion

This section discusses the robustness of the proposed technique based on videos taken from five different databases. To confirm the robustness of the proposed algorithm, we tested using video that presented a different environment or camera angle. Specifically, the video sequences were recorded in multiple environments (indoors and outdoors) and multiple views.

Figures 2(a)2(f) show some of the background frames used in this research. These background frames were generated by using the median value of selected frames in the video sequences, except for the videos obtained from the MoBo database. Rather than stating the filename, we assigned letters, from A to E, to identify the videos representing the data: A refers to SESRG, B to MoCap, C to MoBo, D to MuHAVi, and E to HMDB51. The words in brackets indicate whether the video environment was either indoor or outdoor.

Next, we present the subjective results of our object extraction. Since this study involves video data with a multiple number of frames, we only depict the results obtained for frame number 10 in data A1 and frame number 29 in data B1 to represent the outdoor and indoor scene samples. Figures 3(a) and 4(a) illustrate the original images of a frame in videos A1 and B1, respectively. Figures 3(b) and 4(b) depict the ground truth images.

Figures 5 and 6 present the extracted subjects in both indoor and outdoor environments using FD, AM, RA, and RGA, with and without the SDGD filter.

The first column of Figures 5 and 6 show that all the basic BGS techniques were capable of detecting the object in the scene of interest based on the tested videos. However, many pixels were missing and resulted in a smaller blob compared with the ground truth image. Our proposed technique solves the problem of missing pixels and reduced blob size because it combines the SDGD filter with FD, AM, RA, and RGA, as shown in column 2 of Figures 5 and 6. The pixel size of the extracted object is slightly enlarged and becomes more compound. Foreground detection showed significant improvement after the proposed technique was applied to all datasets.

Tables 1 and 2 show the number of TP, TN, FP, and FN for the selected indoor and outdoor samples with and without the addition of the SDGD filter in the BGS techniques. Then the values of recall, precision, and F-score were calculated based on mathematical equations (10)–(12).

Based on the findings shown in Tables 1 and 2, we can see that the number of TPs has increased significantly, which proves that our technique is able to detect more of the compound blob than the original methods. This finding is also in line with the increment of precision values. Tables 1 and 2 also show that the F-score increased for both samples when we add the SDGD filter.

The graphs in Figures 7 and 8 show the F-score trends in both A1 and B1 for the four BGS techniques, namely, FD, AM, RA, and RGA. The solid lines represent the F-score results using our proposed technique, that is, with the SDGD filter, whereas the dashed lines represent the F-score results without using the SDGD filter. Based on Figures 7 and 8, higher F-score values were noted for the A1 and B1 videos when using the proposed hybrid technique compared with those when using the basic BGS techniques. Thus, our proposed technique improves upon traditional BGS techniques.

To confirm the effectiveness of the proposed technique, we tested the algorithm in six different videos with six different backgrounds, as obtained from the five different databases. Table 3 shows the performance of FD, AM, RA, and RGA with and without the SDGD filter, in terms of the F-score and average accuracy percentage of all six video samples. Columns 5 and 8 show the percentage of improvement for both F-score and average accuracy percentage. Based on Table 3, the use of an SDGD filter improved the average F-score values for all data compared with the values produced by the methods without SDGD. Column 5 of Table 3 shows that the F-score values improved by 1% to 9%. Therefore, the proposed technique, compared with existing techniques, enhances object extraction.

Additionally, Table 3 depicts an increment in average accuracy percentage for each tested technique except for videos A2 and E1. A2 had poor video quality because the footage was taken in a corridor without proper lighting. Because of the poor video quality and bad lighting condition, the SDGD filter was unable to segment the foreground subjects properly and produced an unwanted shadow in the foreground. Nevertheless, our proposed technique is capable of detecting foreground pixels with over 90% accuracy on all videos tested.

Meanwhile, Table 4 exhibits the results of classifying human and non-human recognition based on the segmented frame generated by the proposed technique. In Table 4, ANN successfully recognized human and non-human images from the generated frames, using the improved FD technique. Incorrect classifications were minimal in all ten experiments.

Figure 9 presents a matrix describing the overall results of classification testing. The average recognition rate for human and non-human categories was 98.78% and 98.72%, respectively. The rate is higher for both categories as our algorithm provided good human and non-human blob images, which allowed ANN to distinguish between the two categories. Further, the overall performance in both classes was 98.75%. The findings confirmed that the proposed algorithm, which combines FD with the SDGD filter, could generate better silhouette images, thereby facilitating recognition of human and non-human images in the segmented images/frames.

From the matrix, the information of TP, TN, FP, and FN can be extracted; hence Table 5 shows results of the statistical analysis done on the classification findings by ANN.

5. Conclusion

We presented a new hybrid approach that incorporates the SDGD filter with four basic BGS techniques, namely, FD, AM, RA, and RGA. This hybrid technique enhanced segmentation performance, as indicated by the F-score values and average accuracy percentages. The technique was tested on six different videos from five different databases, and each video was taken either indoors or outdoors and showed different scenes. An ANN classifier was used to classify human and nonhuman images appearing in the segmented images generated by our algorithm. As the algorithm was capable of providing good blob images, ANN recognition of human and non-human images in the silhouette images was facilitated.

Although computational time increased, this aspect is acceptable considering the enhancement and the characteristics of second-order derivatives. Therefore, this study is valid and suitable for implementation in non-real-time applications. The proposed hybrid technique can improve upon traditional BGS techniques as indicated by the improved F-score, accuracy, and ANN recognition values after testing various data sources and data environments. The technique can also be considered in detecting moving objects in non-real-time applications, such as investigations of human actions or traffic conditions.

Acknowledgments

This work was supported by Universiti Kebangsaan Malaysia (UKM) research Grant DPP-2013-003 and Ministry of Higher Education (MoHE) research Grant LRGS/TD/2011/ICT/04/02.