Abstract

Estimating the number of pedestrians based upon surveillance videos and images has been a critical task while implementing intelligent signal controls at intersections. However, this has been a difficult task considering the pedestrian waiting area is an outdoor scenario with complex and time-varying surrounding environment. In this study, a method for estimating pedestrian counts based on multisource video data has been proposed. First, the partial least squares regression (PLSR) model is developed to estimate the number of pedestrians from single-source video (either visible light video or infrared video). Meanwhile, the temporal feature of the scenario (daytime or nighttime) is identified based on visible light video. According to the recognized time periods, pedestrian count detection results from the visible light and infrared video data can be obtained with preset corresponding confidence levels. The empirical experiments showed that this fusion method based on environment perception holds the benefits of 24-hour monitoring for outdoor scenarios at the pedestrian waiting area and substantially improved accuracy of pedestrian counting.

1. Introduction

Estimating the number of pedestrians is critical within the intelligent transportation system. The pedestrian counts have been a vital input for intersection signal control [1], the guidance of passenger flow, and early warning of large-scale crowd gathering [2, 3]. However, the approach of estimating pedestrian counts under outdoor scenarios, such as the pedestrian waiting area, is still an unsolved challenge.

Generally, there are two main approaches to estimate the number of pedestrians. One kind is based on reliable tracking of individual pedestrians, which achieves the purpose of counting pedestrians through identifying each individual pedestrian based on image data [47]. However, this method is suitable for the case where the pedestrian density is low. If the pedestrian density is high and there is severe pedestrian overlapping, the performance of the method will be deteriorated. The other approach extracts feature from image data and applies regression analysis techniques to estimate the pedestrian counts rather than trying to identify each pedestrian in the image. This method is concluded to be more flexible since there is no need to track each pedestrian in the image.

The surveillance video data have been frequently adopted to estimate the number of pedestrians, which can be further divided into visible light video and infrared video. Infrared video is mostly used to determine whether there are people at scene or whether the target is a human being [811]. But, they were barely used for estimating the pedestrian counts. On the contrary, tremendous efforts have been investigated on the estimations of pedestrian counts using visible light video. For instance, Davies et al. [12] used geometric features such as areas and perimeters to estimate the number of pedestrians in the image. He [13] proposed a two-region learning algorithm, applying improved aggregate channel feature detection and Gaussian process regression to estimate the number of pedestrians. Chan [14] segmented the image, extracted the features of each segmentation region, and then used Gaussian process regression to learn the correspondence between the features and the number of pedestrians in each segment. Zhang [15] applied dimensionality reduction techniques to process high latitude features of images and performed regression analysis. Li [16] proposed a feature description operator combining wavelet transform and gray level cooccurrence matrix and used SVM to obtain the pedestrian density model. Yan [17] used the simile classifier to optimize the subimage and then used the regression analysis model to establish the relationship between subimage blocks and the number of pedestrians. However, the abovementioned studies are based on visible light video which is sensitive to lighting conditions and cannot be implemented for monitoring the pedestrian waiting area for the whole day.

In this study, we propose a pedestrian number estimation method which is dependent on fusion of visible light video and infrared video based on environment perception, in order to realize 24-hour pedestrian counts detection for the pedestrian waiting area. First, partial least squares regression (PLSR) was employed to obtain the number of pedestrians from the image based upon visible light video and infrared video, respectively. Then, based on the environmental feature obtained from the visible light video, an information fusion model is established to obtain the number of pedestrians in the image. The specific schematic diagram is shown in Figure 1.

The remaining of this paper is organized as follows: in Section 2, we describe the image processing and how to extract features from images. Then, Section 3 describes the establishment of the pedestrian count estimation model and how to fuse the result of visible light detection with the result of infrared detection. And the report and analyses of the experimental results are given in Section 4 while Section 5 summarizes the work and discusses future directions.

2. Image Processing

In this section, visible light image processing and infrared image processing procedures are introduced correspondingly.

2.1. Visible Light Image Processing

The most important task in image processing is to extract the foreground of the motion from the image. For visible light images, background difference method was adopted to obtain the motion foreground in the image.

Since the background image would gradually change along the time in the actual scene, the background image needs to be updated in real time. Kalman filter was used to update the background here. To be specific, the background image at the time is determined by the background image at time and the real-time image at time , which includes both prediction and update. The forecast formula is as follows:where is the background optimal value at time , is the background prediction value at time , is the covariance at time , is the prediction of covariance at time , and is the systematic process error.

The background update formula for time is as follows:where is the system measurement error, is the system gain, is the gray image acquired by the visible light camera at time , and is the time background. The optimal value is the covariance at time .

The visible light image is differentiated from the corresponding background image . The background difference result is

Then, a binary region-of-interest (ROI) mask proposed by Chan [18] was applied to , which not only reduces the amount of subsequent calculations, but also prevents some interference in noninterest areas. After applying the ROI mask, the binary foreground of visible light images is calculated bywhere is the threshold used for binary processing. In our experiments, we set .

For the image , the closed operation (dilation followed by erosion operation) is to fill the small holes in the connected domain, connect adjacent objects, and smooth the boundary [19]. Then, it analyzes the connected domain and eliminates the connected domain with smaller area to remove noise [20]. The final result of visible light image processing is the image . Then the set of blobs in iswhere is the -th blob in the image and is the total number of blobs in the image .

For example, Figures 2(a) and 2(b) are the original and background images, respectively. Figure 2(c) shows the background difference result . Figure 2(d) is the ROI mask. Figure 2(f) is the final result .

2.2. Infrared Image Processing

The infrared video data are imaged by thermal radiation, which is not sensitive to ambient light. Since the pedestrian generally appears as a highlighted area in the infrared image, we extract the foreground of the image by the gray value of the image. First, the projection images of the infrared images on the R, G, and B color channels are analyzed to find the projection image which has the greatest difference between pedestrians and the surrounding environment. Figure 3 illustrates that, in the projection image on the G color channel, the characteristics of the pedestrian are the most prominent and easier to distinguish. This projection image is defined as the grayscale image .

With the application of the ROI mask, the binary foreground of infrared images is calculated bywhere is the threshold used for binary processing. In our experiments, we set .

For the image , the closed operation and connected domain analysis are also performed to remove the noise and ensure the integrity of the pedestrian. The final result of the infrared image is . Then the set of blobs in iswhere is the -th blob in the image and is the total number of blobs in the image .

For example, Figure 4(a) is the image . Figure 4(b) is the ROI mask and Figure 4(d) is the final result .

2.3. Feature Extraction

Here the visible light image feature extraction procedure was taken as an example, while the feature extraction of infrared images is similar. The contained features of blobs and the inferred number of pedestrians were further extracted. Take the blob as an example to calculate its geometric features and positional features using the following steps:

(1) Area , which is the weighted sum of all pixels in the blob,

(2) Number of edge points , which is the weighted sum of pixels on the boundaries of the blob, where denotes the edge image that is generated by the Sobel edge detector on the image .

(3) Length of the spot , which is the maximum number of pixels in the horizontal direction of the blob,

(4) Height of the spot , which is the maximum number of pixels in the vertical direction of the blob,

(5) Horizontal position , which is the horizontal position of the center pixel of the blob in image (for infrared images it is ),where denotes a horizontal position set of the pixels of the spot in the image .

(6) Vertical position , which is the vertical position of the center pixel of the blob in image (for infrared images it is ),where denotes a vertical position set of the pixels of the spot in the image .

Features , , , and have strong correlations with pedestrian crowd density. In general, at the same position of the image, the larger the values of , , , and , the more the number of pedestrians included in the blob. And the further away a pedestrian is from the camera lens, the smaller he is in the image. Therefore, we use position features and to record the positional relationship between pedestrians and the camera lens to ensure the accuracy of pedestrian counting.

Since the final decision result is based on visible light detection results and infrared detection results, an indicator was further introduced to selectively believe based on the distinct detection methods in different situations. In this study, the ambient brightness from the visible light image is the indicator.where denotes the ambient brightness at time .

3. Model Establishment

This section focuses on how to infer the number of pedestrians from the extracted features. There are mainly two tasks being carried out: (1) a pedestrian count estimation model was developed based on the features of single-source video to establish; (2) then information fusion model was established based on the detection results of multisource video.

3.1. Pedestrian Count Estimation Model

In order to estimate the number of pedestrians in the blob and prevent the problem that overaggregated data might fail to reveal the true correlation between variables, we apply partial least squares regression (PLSR) [21, 22]. PLSR is a method for multivariate statistical analysis. It draws on the idea of extracting information from explanatory variables in principal component regression, and can effectively solve the multiple correlation problem between variables.

The independent variable contains elements and the dependent variable contains elements. In order to study the statistical relationships between the dependent variable and the independent variables, assume there are sample observations, which constitute the independent variable set and the dependent variable set . The normalization results of and are and .

First, the main components are extracted in and . and are the first component of and . Then and need to meet the following conditions:where denotes the covariance between and .

After the first components and are extracted, the regression of versus and the regression of versus are performed, respectively. If the regression equation has reached a satisfactory accuracy, the algorithm terminates; otherwise, the second round of component extraction will be performed using the residual information of and . So reciprocate until a satisfactory accuracy is achieved. If we finally extract a total of components , PLSR will be implemented by implementing regression of and then expressed as regression equations for the original variables ().

Take the visible light image as an example. Based on PLSR, we establish the pedestrian estimation model where the feature set of the blob is an input and the number of pedestrians included in the spot is an output.

Based on the above model, the number of pedestrians included in each blob in the image is calculated. The total number of pedestrians in the image iswhere denotes rounding of .

is the detection result of visible light. Using the same method, we can get the infrared detection result .

3.2. Information Fusion Model

The environment of outdoor scenarios like the pedestrian waiting area varies substantially along the daytime due to the lighting conditions, temperature, etc. In order to ensure the accuracy of the pedestrian count estimations, a method of combining the visible light detection result with the infrared detection result was proposed with its advantages of applying feasibility in different scenarios. First, the current scenario (day or night) is identified based on the ambient brightness obtained above. Then, according to the recognition result of the scenario, a corresponding confidence level is set for the detection result of the visible light and the detection result of the infrared. In the case of good daylight and good light, we believe the detection result of visible light; otherwise we believe the detection result of infrared. Therefore, the information fusion result at time iswhere is the confidence level of visible light detection result and is the confidence level of infrared detection result. is the environment segmentation threshold and we set it .

4. Empirical Analysis

The empirical analysis was conducted at the campus of Tongji University. A total of 106 groups of daytime images and 18 groups of night images (as shown in Figure 5) were collected. The visible image is 640480 pixels, and the infrared image is 320240 pixels. This section uses 8-fold cross validation to divide the image set into a training set and a test set and then to check the accuracy of the proposed method.

4.1. Daytime Scenario

For the subset of daytime images, the visible light detection results are shown in Table 1 and the infrared detection results are shown in Table 2. Figure 6 is a schematic diagram of information fusion in a daytime scenario. It can be seen from Figure 6 that the visible light image is clearer and the noise in the processing result of the visible light image is smaller. This is because the resolution of the visible light image is higher than that of the infrared image. Therefore, the result of information fusion gives credibility to the detection result of visible light, which is consistent with the actual situation.

4.2. Nighttime Scenario

For a group of night images, since there are no street lights near the experimental site, this would cause the visible light detection complete failure. Therefore, the visible light detection result is 0. The infrared detection results are shown in Table 3. Figure 7 is a schematic diagram of information fusion in a night scenario. Since the ambient brightness at this time is very low, the result of the information fusion is selected to believe the infrared detection result, which is consistent with the actual situation.

4.3. Influence of Thresholds and

The thresholds and are key parameters in this study, which were used to distinguish pedestrians from the background in the image. If and are too large, a large number of pixels representing the pedestrians in the image will be misjudged as the background, which will result in incomplete motion foreground. As a consequence, the final pedestrian count result will be small. If and are too small, a large number of pixels representing the background in the image will be misjudged as pedestrians, so that the foreground of the motion will contain a lot of noise. And the final pedestrian count result will be large.

Here different thresholds were performed as an example, and the threshold is similar. For the same visible image, we set the threshold to 30, 45, and 70, respectively. The results of the extraction of the motion foreground are shown in Figure 8 and the results of the pedestrian detection are shown in Table 4. According to Figure 8 and Table 4, we can find that when the threshold is too small (=30), the motion foreground contains more noise, and the final pedestrian count result is too large. When threshold is too large (=70), the motion foreground is incomplete and the final pedestrian count results are small. Therefore, the thresholds and need to be set according to the characteristics of the data and the actual situation.

4.4. Contribution of the Features

The method in this paper is based on six features (Section 2.3 Feature Extraction). In order to evaluate the contribution of these features to the final result, the average elastic coefficient is introduced. The bigger the average elastic coefficient of the feature, the greater contribution to the final result. And the average elastic coefficient iswhere is the average of the independent variables and is the average of the dependent variables.

In the visible light model and the infrared model, the average elastic coefficient of each feature is calculated separately. The calculation results are shown in Figure 9. We have found that the feature is the most influential feature of the final result in both the visible light model and the infrared model, because is the most important parameter to represent the distance from the pedestrian to the lens in the testing scenario of this paper. On the other hand, the features , , and are reasonable predictors of crowd density, which reflects the number of pedestrians from different angles. One possible explanation for the low contribution of features and is that the camera’s field of view is parallel to the road, not vertical or oblique in the testing scenario of this paper.

4.5. The Efficiency of Background Update

Visible light video detection is based on background differences to obtain motion foreground. Since the environment around the pedestrian waiting area varies greatly in a day, real-time background update is a must. Here, the efficiency of the background update method based on Kalman filter is tested. Three rounds of tests were performed on 102 images. The results are shown in Figure 10 and Table 5. (Note: this test was performed on a laptop and the test software is MATLAB 2017b.)

According to Table 5 and Figure 10, it can be found that the average time of background update is about 0.0168s. The result is ideal and can meet the needs of practical applications.

4.6. Accuracy Verification

The accuracy of the proposed method is verified based on 8-fold cross validation. 124 images are randomly divided into 8 groups. Each group is in turn used as a test set for the model, and the remaining 7 groups serve as a training set for the model. The experimental results are shown in Figure 11 and Table 6. As can be seen from Figure 11, the accuracy of the individual visible light detection is sometimes higher than that of the individual infrared detection, and sometimes lower. However, the accuracy of information fusion detection is always the highest. Combined with Table 6, the average accuracy of information fusion detection is higher than that of the individual visible light and infrared detections, while considering both daytime scenarios and nighttime scenarios.

Moreover, since there is no public dataset containing both infrared and visible light images, we test other methods on the dataset of this paper to show the advantage of the proposed method. Table 7 listed the prediction accuracy comparisons. It can be seen that the fusion method could provide better performance with lower MSE and higher accuracy as compared to the existing methods. Therefore, for 24-hour pedestrian counting in outdoor scenarios, the fusion method between visible light video and infrared video from the perspective of environment perception is more effective than the single video (visual videos or infrared videos).

5. Conclusion

In this study, a fused method between visible light video and infrared video based on environment perception for estimating the number of pedestrians has been proposed. And the method is intended to combine visual light information with infrared information to enable pedestrian counting techniques for complex outdoor scenarios. The proposed approach is depending on two aspects: the estimation of the number of pedestrians based on single-source video and the information fusion based on multisource detection results. First, PLSR was applied to combine the dimensionality reduction analysis with the regression analysis to establish the pedestrian number estimation model based on single-source video. The method holds the advantages of reducing the redundancy of the data in the feature set and effectively solving the multiple correlations between variables. Meanwhile, the ambient brightness was employed to identify the scene of images and integrate the visible light detection result and the infrared detection result. The empirical analyses showed that, for 24-hour pedestrian counting in outdoor scenarios, the proposed method has better performance than the method using single information source, which expands the application scenario of pedestrian counting and provides reference for relevant research.

As for future analyses, one thing that needs to be expanded is the sample size of the empirical analyses and test the feasibility of utilizing deep learning networks to identify different scenarios (day, night, rain, fog, etc.). Besides, being under heavy fog or rain conditions will substantially increase the noise of video, and how to reduce the interference of these noises on pedestrian count would be a challenging issue in the future to be investigated. In addition, continued improvements of the information fusion model and the feasibility of employing new sensing equipment (such as laser scanners) to estimate the number of pedestrians will be tested.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is supported by National Key R&D Program of China (2016YFB1200402), National Natural Science Foundation of China (61703308; 71771174), and the Fundamental Research Funds for the Central Universities.