Abstract

Subway is an important transportation means for residents, since it is always on schedule. However, some temporal management policies or unpredicted events may change passenger flow and then affect passengers requirement for punctuality. Thus, detecting anomaly event, mining its propagation law, and revealing its potential impact are important and helpful for improving management strategy; e.g., subway emergency management can predict flow change under the condition of knowing specific policy and estimate traffic impact brought by some big events such as vocal concerts and ball games. In this paper, we propose a novel anomalies detection method of subway passenger flow. In this method, an improved robust principal component analysis model is presented to detect anomalies; then ST-DBSCAN algorithm is used to group the station-level anomaly data on space-time dimensions to reveal the propagation law and potential impact of different anomaly events. The real flow data of Beijing subway are used for experiments. The experimental results show that the proposed method is effective for detecting anomalies of subway passenger flow in practices.

1. Introduction

Owing to the high efficiency and the comfort, subway has generally become first choice for citizens’ daily travel, and it directly facilitates the city’s economic development and people’s quality of life. For example, as one of the busiest subway systems in the world, the Beijing subway has the world’s largest annual ridership with 3.03 billion trips delivered in 2016, averaging 8.26 million per day, with peak single-day ridership reaching 10.52 million. The public transportation in Beijing accounts for 45% of total traffic, in which the ridership of subway dominates nearly 40%.

Although bringing great convenience for residents, the subway system becomes more vulnerable at the same time, as the subway system is a large and complicated network running in a restricted time schedule. For example, there are 22 lines and 370 stations in Beijing subway, and more than 500 trains are running on the network with the minimal peak headway in 90 seconds. This will be more critical in the cases of encountering exceptional events, such as station accidents, major activities, and bad weathers. Once a station has an anomaly event, such as failure operation and chaos in station, the retention of passengers would happen, which would bring great loss with high security risks. Moreover, the bad situation would propagate through the urban subway system since it is a relatively closed and connected network. So the impact of anomaly event will not be restricted in a specific station, it may affect the traffic system in a large region, and the influence of abnormal events usually shows a certain space-time law. Thus, it is necessary to detect anomalies in urban subway transportation system and figure out its spreading rules, which can provide valuable proof for management to making strategy for dealing with abnormal events.

However, in the traditional road transportation system, many methods have been proposed for detecting transit anomalies, such as the Automatic Incident Detection (AID) algorithms [13] based on comparison, statistics techniques, traffic flow model, and so on. These methods are mostly applied in freeway and urban roads, and they link the main regions of a city and try to find unexpected traffic flow between any two regions [4, 5]. As the subway is a different traffic system from the traditional road transportation, the above methods are difficult to introduce to subway system. At the present, few studies focus on the anomalies detection of passenger flow in subway transportation.

Since the subway anomaly events are always uncertain and sporadic, the anomalies show obvious sparse property among the whole subway traffic data. From this observation, based on the available subway passenger flow data collected by the AFC (Automatic Fare Collection) system, we propose a novel anomalies detection method based on Robust Principle Component Analysis (RPCA) model [6], which represents the temporal-spatial distribution of data and the sparsity of the anomalies by the low-rank and sparse regularization. Additionally, in order to reveal the propagation law and potential impact of different anomaly events, the ST-DBSCAN algorithm is adopted to group the station-level anomaly data on temporal-spatial dimensions. Thus the proposed method can not only detect anomaly of a single station but also find the relations among anomalies. Figure 1 shows the main structure of our model.

The main contributions of this paper are summarized as follows:(i)A novel anomaly detection method of subway passenger flow based on RPCA is proposed, which utilizes low-rank nature of the passenger flow data and the sparsity of anomaly data. Experimental results demonstrate that our approach can achieve an accurate anomaly detection.(ii)The ST-DBSCAN clustering algorithm is adopted to explore the temporal-spatial propagation law of anomalies, and the obtained expected results are verified by tweet data. The distribution law of anomalous flow caused by different anomaly events can provide prior information to cope with possible anomalies.

The rest of this paper is organized as follows. Related works are summarized in Section 2. Section 3 gives the methodology. Section 4 reports experimental results on real data and their visualization analysis. Finally, we conclude this paper in Section 5.

In this section, we review the commonly used methods for anomalies detection in traffic systems and introduce RPCA model related to our work.

2.1. Anomalies Detection Methods of Passenger Flow

Most of the existing anomalies detection methods of traffic flow are in highways or urban roads scenarios, and the traffic data from fixed detectors is usually used for analysis. The typical approaches include the statistical methods, the comparison methods, and the traffic flow model based methods. The famous comparison algorithms are the California algorithm and its derivation [1], which discriminate the anomaly event by comparing traffic parameters between adjacent detectors. But they are not suitable for subway passenger flow because the relation between neighboring stations is not exactly similar with the relation between neighboring detectors. The statistical methods (like SND [2]) achieve traffic anomaly by judging change rate of the traffic parameters, and they adopt the threshold method [7] to identify unreasonable data values based on historical data. To these methods, the suitable thresholds are difficult to choose. The traffic flow model based methods (like McMaster algorithm [3]) define boundary between crowded traffic flow and noncrowded traffic flow to determine a speed threshold for distinguishing, which is not well for subway due to the difference of flow pattern while time and space scales change. In addition, the wavelet analysis [8] is used for detecting anomalous samples by separating the high frequency components and the low frequency components of traffic data.

For the transportation system of a city scale, the current studies on anomalies detection are mostly region-based. Pang and Linsey Xiaolin [4, 5] partition city into uniform grids and report anomalies if traffic volumes in neighboring cells are different, while Shekhar [9] focuses on detecting spatial outliers in graph structured datasets. Similarly, Liu [10] and Chawla [11] partition the city into disjoint regions linked by major roads and then find unexpected traffic flow between any two regions. However, the above methods are either road-based or region-based and the former cannot accurately identify location of events, and the latter may result in loss of information because of the coarse region partition.

There are few works concentrating on the anomalies detection of subway passenger flow. Some anomalies detections in subway system focus on the pedestrian abnormal activity inside the station [12, 13], and they generally adopt visual recognition techniques based on the video surveillance system in the station; thus the applied scale is valid only in the view of the cameras. Besides, the other studies on passenger flow data of subway mainly focus on passenger flow prediction and analysis [14, 15]. Differently, in this paper, we conduct anomalies detection of subway passenger flow and explore the temporal-spatial impact of anomaly events.

2.2. Robust Principal Component Analysis

Recently, due to the power of revealing the intrinsic structure or property underlying the data, the low-rank and sparse theory have been successfully applied in numerous areas such as image recovery and denoising [16], background modeling, and foreground object detection of video image [17]. RPCA is a typical model utilizing the low-rank and sparse matrix decomposition for data restoration and denoising. The basic idea is that the original data in form of a numerical matrix can be decomposed into a low-rank matrix and a sparse matrix as follows:where is the raw data usually having noise, represents the expected clean data which is assumed having low-rank property, and represents the noise data or outlier which is considered being sparse. The target of RPCA in (1) is to estimate the unknown and given .

However, the optimization problem in (1) is a NP-hard problem [18] due to its nonconvexity and discontinuity. On one hand, the low-rank term should be processed properly. For this purpose, a widely used solving scheme is replacing rank () by its convex envelope, nuclear norm [6, 19], as nuclear norm minimization approaches can perform stably without knowing the target rank of the recovery matrix in advance. On the other hand, the nonconvexity and discreteness of the penalty make it be not preferred. Considering that is also good at modeling the sparse noise [6] and has high efficient solution, the term in (1) is replaced with . Thus, (1) can be written aswhere denotes the nuclear norm, is the th largest singular value of matrix , and , is the element of .

In this paper, we introduce RPCA into the anomaly detection of subway passenger flow. Moreover, the passenger flow data matrix has low-rank structure because it shows regular cycles with respect to day, week, month, and year. In addition, the real-world data is usually polluted by noise or outliers, and the outliers are considered anomalies for detection. So we adopt RPCA to represent the subway passenger flow and detect the anomalies by the sparse outliers. Additionally, we consider the temporal correlation among the data and propose an improved RPCA. The next section will give the improved RPCA in detail.

3. Methodology

In this section, we first represent the subway passenger flow as a matrix and give it decomposition for anomalies detection. Then the improved RPCA is applied to obtain preliminary abnormal flow information. Finally, the detected anomalies are grouped into several clusters for revealing the temporal-spatial laws.

3.1. Subway Passenger Flow Representation and Decomposition

The raw subway passenger riding data are collected from the subway AFC system; they include the boarding or alighting time at a station, the boarding line ID or alighting, and the boarding station ID or alighting. Based on the raw riding data, the subway passenger flow data are calculated in one hour interval, and then we obtain the subway passenger flow matrix , which is constructed with the row and column corresponds to the date and the time interval of each day, respectively. Therefore, each element in the matrix represents the passenger flow of a station at a certain time interval of a certain day.

As the passenger flow of a subway station shows similar varying degrees taking year, month, week, hour, or minute as a cycle, the temporal patterns of the passenger flow matrix are typically a low-rank matrix [20]. Besides, passenger flow of adjacent stations also show certain similarity, which further supports the low-rank property of the passenger flow matrix. However when anomaly events happen, the low-rank property of the flow would be ruined by the outliers. So the matrix D can be considered as a combination of normal and outliers. Let and represent the expected flow component and the outlier interference of a station at time on date , so the measured passenger flow at time can be expressed as . Collecting measurements and introducing matrices , the passenger flow matrix can be decomposed by

By this decomposition, the subway passenger flow can be represented by two components: the expected flow and the anomalous part . The anomalous part is explained as special events or special activities around the station; it is sporadic over time and may last for short periods relative to the (possibly long) measurement period , which means that only a small fraction of the elements in observation traffic flow matrix is supposed to be anomalous. Therefore, the anomaly matrix would be sparse both in rows and columns.

From the above analysis, the subway passenger flow completely has the RPCA model in (2) with the low-rank and sparse terms. In the following, we further exploit the temporal constraint for the model and propose our improved RPCA model for the anomalies detection of subway passenger flow data.

3.2. The Improved RPCA

For the subway passenger flow matrix , the two adjacent rows of the same weekdays in different weeks are often approximately equal except some outliers, derived from the obvious day cycle of the passenger flow measurement. This property is conductively true for the corresponding expected flow , while the current RPCA model has no specific description for this important property. So we propose a constraint to keep the consistence among rows of by adding an item to the current RPCA model. The matrix is defined as follows:

The above temporal differential matrix is , in which the central diagonal is defined as ones and the first upper diagonal is defined as negative ones. The temporal constraint matrix intuitively expresses the fact that nominal passenger flow matrices at same time intervals for the same weekdays are usually similar. Actually, captures consistence between two adjacent rows of . Moreover, compares with norm, norm is more inclusive and robust while considering temporal abrupt changes [6]. Thus we choose norm to minimize , as it enforces the matrix temporally stable [21]. Hence, we revise the RPCA model in (5) and obtain the following improved RPCA model:where controls weight of the term .

To solve the improved model, we adopt the Alternating Direction Method of Multiplier (ADMM) [22], which is a popular algorithm for solving convex optimization problems. For this purpose, three auxiliary variables , and are introduced; let and , where r is the decomposition rank of . Therefore (5) is rewritten as

Remove the linear equality constraints in (6) with augmented Lagrangian method, and then we have the following objective function:where , , and are Lagrange multipliers, is adaptive penalty parameter, and represents the standard trace inner product. We adopt an alternative iterations to solve this optimization as follows.

Update . When , , , and are fixed, (7) degenerates into a function with respect to . So we solve by the following optimization problem:

Taking derivative of the objective function in (8) and setting it to 0, the closed-form solution is given by

Update . When others are fixed, in order to update , one needs to solve the following minimization problem:whose solution is given by [23]:where for and is zero otherwise.

Update . In a similar way with updating , the closed-form solution of S is given by

Update . In a similar way with updating , the closed-form solutions of are given by

Update , , , and . The Lagrangian multipliers , , and and penalty parameter could be updated as follows:where is a constant and is the upper bound of .

Convergence Conditions. The stopping criterion is measured by the following problem:where is tolerance error. If the convergence condition is met, the iteration terminates. The overall algorithm is summarized in Algorithm 1.

Input: Data matrix , the parameters ,
Initialize:  ,
, ,
,
, , , , .
1:while  not converged and   do
2:Update via (9)
3:Update via (11)
4:Update via (12)
5:Update via (13)
6:Update via (14)
7:Update the multipliers: via (15)
8:.
Output: Expected matrix , Sparse matrix .

Once solving the improved RPCA, we obtain the expected flow and the anomalous part . To eliminate the interference of the noise, we use the three-sigma rule of thumb [24] to filter elements of . is the standard deviation of ; if is considered an allowable deviation and set as , then we get the anomaly flow matrix . Each element of represents the abnormal amplitude of the space-time position; it may be positive or negative. The positive indicates that the passenger flow is higher than the expected flow and the negative indicates the passenger flow is lower than the expected flow.

3.3. Discovering the Potential Temporal-Spatial Laws among Anomalies

Based on the improved RPCA, the anomalies of subway passenger flow are detected. To explore the potential laws of anomalies propagation, we group the anomalies into several clusters to identify anomalies in the region and their propagation laws. Because the detected anomalies have similar temporal-spatial characteristics, we use ST-DBSCAN algorithm [25] to cluster the station-level anomalies to find the anomaly in the region. We regard as the feature of an anomaly data object, and are the longitude and the latitude of a station, and denotes the time interval. The ST-DBSCAN algorithm requires three parameters: space radius , time window , and density threshold ; the first two parameters determine neighborhood on temporal-spatial dimension.

The algorithm starts with the earliest anomaly data object and retrieves all neighbors of point within spatiotemporal neighborhood. If the number of neighbors is greater than , a new cluster is created which has as core of the cluster. Then, the algorithm iteratively collects neighbors beginning with another core point. The above procedure continues until all points have been processed.

4. Experiments

In this section, we evaluate the robustness of the improved RPCA by adding noise on a set of real subway passenger flow data, comparing with RPCA [6] the wavelet transform method [8] and the threshold method [7]. Then we apply our proposed framework on the real subway passenger flow data for anomalies detection and analysis; meanwhile the results are verified with traffic related tweet data.

4.1. Robustness Evaluation

The improved RPCA model is characterized by its robustness to noise, so we first validate the performance of our methods on noisy passenger flow data compared with the related methods.

First, we construct three real-world passenger flow datasets from three different geographical positions shown in Figure 2. By exploiting the strong weekly seasonality observed in the data, we convert hourly flow within one week into a row vector and stack 12 weeks vector to form the data matrix which contains much noise. To implement the verification experiment, it needs to know the ground truth. In the case of ground truth being unavailable, we have to estimate a relatively accurate ground truth. Here, we use 4 layers of wavelet to filter the small white noise and then take the average as ground truth value. As a result, we get three relatively clean and ideal ground truth datasets, denoted by .

Next, we add sparse noise on the ground truth matrices to simulate the corresponding noise matrices . The randomly corrupted proportion of these matrices varies from to ; the fluctuation range is of the average of . So we obtain the noisy passenger flow matrices by mixing the ground truth matrices and the produced noise matrices by . These datasets will be used as the test datasets for anomalies detection and evaluating the robustness of the proposed method. The properties of the constructed datasets are summarized in Table 1.

Evaluation Criteria. To evaluate the performance of the improved RPCA algorithm, we use the precision rate in [21] to evaluate the recognition accuracy of anomalies, which are defined as follows:where ,  , and denote the number of anomalies recognized by our model and the number of true anomalies among them, respectively, and represents the actual number of anomalies. is calculated by averaging results over 10 runs.

Parameters Setting. The improved RPCA has three parameters , , and decomposition rank , and they are important for the performance. Rank needs to be as small as possible to minimize matrix sparsity and low-rank error. Here, we use singular value decomposition (SVD) [26] to estimate a superior rank for these three datasets. Figure 3 shows the distribution of the singular value of the three ground truth datasets. The x-axis presents the th singular values and the y-axis presents the cumulative ratio of the first singular values to the sum of all singular values. It can be found that the first singular values almost dominate nearly energy in all three datasets. To simplify, the rank is set as for all datasets.

For and , we first change one parameter while fixing the other parameter in the model, and the parameter is gradually taken as . Then we achieve the relatively superior parameters. Next, we tune these parameters in a narrow range from to by step of . Finally we obtain the relatively optimal and . The setting of experimental parameters is shown in Table 2.

In our experiments, we apply 5 layers of discrete wavelet transform based on the wavelet of DB4. For the threshold method, the threshold value of flow at different time interval is different and we compute the mean value and standard deviation of and set the confidence interval as ; it is judged to be an anomaly if is beyond the confidence interval.

Experiment Result Analysis. Figure 4 is the comparison results of the four methods. As can be seen, there is a downward trend of with the increasing of data corrupted proportion. The improved RPCA is superior to other methods, followed by RPCA. Notice that the threshold method has worse performance, because the noise data reduce the calculation accuracy of the threshold range. When the data corrupted proportion is low, the wavelet transform has a good detection result, but the local stationarity is destroyed in a high corrupted proportion, which results in the detection accuracy of a steady decline. The improved RPCA is more robust than RPCA even in high corrupted proportion, because the constraint item can capture the feature of abrupt changes of the time series data when the sparseness of anomalies becomes weak. Additionally, the improved RPCA performs well on different stations from different geographical positions. In a word, the improved RPCA is more suitable for anomalies detection of subway passenger flow.

4.2. Anomalies Detection and Verification

In order to demonstrate the practicability and the authenticity of the improved RPCA, we conduct anomalies detection experiments on real-world datasets and verify the results with tweet data. Figure 5 shows the decomposition results of the exit flow of Xidan station. The low-rank expected flow matrix represents the weekly pattern and the anomaly matrix successfully captures multiple outliers.

To further analyze and verify the anomalies, we collect tweet data which contain a wide variety of information and retrieve events information through natural language processing method. There are four explanatory anomaly regions, highlighted by ellipses in Figure 5. They correspond with the following events extracted from tweet data, as shown in Figure 6, and specific analysis as follows:(i)Event1: The ellipse region 1 in Figure 5 shows the increasing of the exit flow lasting about three hours in the evening. It is because many large shopping malls near Xidan station held sales for Chinese Valentine’s Day, which attracted massive customers and led a rise in exit flow.(ii)Event2: In ellipse regions 2 and 3, the flow was declined. It is because Xidan station was closed for facilitating celebration parade for the anniversary victory of the anti-Japanese war.(iii)Event3: In ellipse region 4, the exit flow was higher than usual. Because it was a commuter day due to legal exchanging holiday, therefore the flow was increasing and consistent with the flow of a working day.

The improved RPCA can not only identify anomalies at the station level but also accurately detect anomalies. These anomalies could be used for a reference for real-time alerting.

4.3. Discovering the Potential Temporal-Spatial Laws among Anomalies

An isolated anomaly may affect neighbored stations consecutively, so anomalies among some stations have strong temporal-spatial correlations. Grouping several anomalies along temporal-spatial dimensions may reveal the evolution or the impact of the isolated anomaly; hence we adapt ST-DBSCAN clustering algorithm to group the anomalies to analyze the propagation feature of the anomalies.

In experiments, space radius (Euclidean distance of latitude and longitude between two adjacent stations), time window hour interval, and density threshold work well. We cluster the anomalies of all stations in one week and name each cluster as an anomaly event.

In Figure 7, we use ellipses to highlight four anomaly events grouped by ST-DBSCAN, and these clustered results are easier to be verified by tweet data and visually analyzed. The ellipse region 1 in Figure 7(a) shows the entry flow decrease of the stations on the same line. It was lasting for five hours and induced by the closure strategy. Meanwhile, it also led to the flow increase of the nearby stations. In particular, the transfer stations such as Dongsi station and Chaoyangmen station had an obvious flow increase. The ellipse region 2 in Figure 7(a) shows the surges of exit flow as attendees traveled to Bird’s Nest stadium for the opening ceremony of IAAF World Championships. In Figure 7(b), ellipse region 3 shows the spreading of the anomaly caused by an hour’s breakdown of the train on Sihuidong station. In Figure 7(c), ellipse region 4 shows the entry flow increase of many stations for one day, since the traffic control of city roads led more people to choose the subway.

After the clustering and verification process, we can discover the potential temporal-spatial laws among anomalies from the following three aspects:(i)Distribution and spreading of anomalies along time and space: how many stations? And how long are they affected? The center of anomalies and the range of spreading from some destine anomaly. As shown in Figure 7(c), ellipse region 4 shows the stations affected by traffic control measures. Ellipse region 3 shows the anomaly in Sihuidong station spread to the adjacent stations.(ii)The serious degree of anomalies: Values in reflect the serious degree of anomalies. In Figure 7, affected stations labeled with the red color having different levels reveal the degrees of impaction. The heavier the color, the more severe the anomaly impact.(iii)The potential impact caused by events: Some anomaly events not only affect the corresponding stations, but also cause the potential impact on the surrounding stations. As shown in ellipse region 1 of Figure 7(a), the closure strategy resulted in a potential increasing flow of the surrounding stations.

Furthermore, we apply the statistical analysis to get some rules shown in Figure 8. The detected events are classified into two categories: some are predefined such as traffic control and vocal concerts and some are emergencies such as subway device failure and a sudden heavy rain. For both two categories events, the above analysis with our method can provide a beneficial suggestion for subway managers:(i)For emergency events, our framework provides distribution laws of anomaly events, and these can be used for estimating anomalies’ propagation and impact on adjacent stations. As shown in Figure 8(a), a sudden heavy rain caused a delay of the evening rush hour, so that managers can further push announcement timely to remind passengers and take emergency measures. This would prevent subway station from chaos and hazard spreading and also save the travel time of passengers.(ii)For predefined events, our framework indicates detailed rules along spatial and temporal dimension, so that subway managers can obtain prior information and make sufficient preparations to cope with possible anomalies. As shown in Figure 8(b), the exit flow of Olympic Sports Center Station surged in the two hours before the beginning of one game, and the entry flow surged in the two hours after the end of this game. These anomalies rules can help to estimate the impact of anomaly flow involved to major urban events and then take mitigation strategies in advance.

5. Conclusion

In this paper, the improved RPCA is suggested to detect station-level anomalies in subway, and ST-DBSCAN algorithm is used to group the detected station-level anomalies into clusters named as anomaly events. This framework can not only precisely locate anomalies in temporal dimension but also find the distribution and spreading in temporal and spatial dimension. With the detection results and impact analysis of events, subway managers can estimate traffic flow impact involved to predicted events and then take corresponding measures. Besides, they can push announcement timely for unpredicted events through decomposing the real-time data.

In future, we shall improve our work in three aspects. First, we shall extend our model to anomalies prediction as well as anomalies propagation process. Second, we shall consider temporal-spatial distribution by extracting comprehensive temporal and spatial information, e.g., OD flow data. Third, we shall propose more efficient ADMM algorithm for solving the proposed model and propose convergence analysis of the algorithm.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by Beijing Municipal Science & Technology Commission (Grant nos. Z151100002115040, Z171100000517003, and Z171100000517004), Project of Beijing Municipal Education Commission (Grant nos. KM201510005025, KM201610005033), and Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality (Grant no. IDHT20150504).