Abstract

The development of the automatic fare collection (AFC) systems provides significant support for predicting passenger flow on urban rail transit. This paper extracts passenger travel patterns using AFC data on urban rail transit in Chengdu, China, over a one-month period. Passengers are divided into two categories based on their travel habits and data mining models, and multinomial logit (MNL) models are separately used to predict their destinations. Furthermore, a two-way search algorithm is developed to search the optimal paths between origin-destination (OD) pairs by considering interchange constraints. Start a path search through the origin point and destination point, respectively, until the shortest path is found. The maximum effectiveness of a path is measured by travel time, interchange time, and the number of interchanges between the OD pairs. Finally, the validity of the proposed passenger flow path prediction method is verified by using the AFC data of Chengdu metropolitan rail transit from April 2018.

1. Introduction

By March 2022, there were 49 cities in mainland China that had constructed urban rail lines, totaling 8,837 km, making it one of the fastest-growing countries in terms of urban rail transportation. As people’s living standards continue to improve, higher demands are being placed on the safety, efficiency, and service levels of urban rail systems. The AFC System’s ridership data provide information for passenger flow-related analysis and station/line status assessment. The accuracy of the model prediction results can be fully guaranteed by simulating and testing the established station passenger flow or passenger flow OD prediction model using historical AFC data. For example, Guo et al. [1], Tang et al. [2] predicted station passenger flows and validated them using historical AFC data; Yang et al. [3], Cao et al. [4], and Yao et al. [5] built an OD matrix prediction model and compared the prediction results with real data to verify their validity. However, people’s travel patterns are heterogeneous [6] and may change over time. The models need to be updated with new indicators or calibration parameters due to changes in the operation of the urban transportation system (e.g., changes in urban land use types or the introduction of new routes). Using the previous model may lead to relatively inaccurate predictions.

To obtain relatively accurate prediction results of urban rail traffic, an increasing number of studies have used data mining on AFC data to extract information, such as station cross-sectional passenger flow and inbound and outbound passenger flow [7], or to obtain travel preferences of established cardholders to build prediction models with stronger generalization capability. Figure 1 shows the timeline of the search in the Web of Science core database with the search formula “(rail or metro or subway or underground) and (forecast or predict) and passenger and (AFC or OD)” (97 search results as of March 31, 2023), and the search results were imported into CiteSpace for visualization. In the two figures of Figure 1, time is increasing in years along the time line from left to right, and the rows represent the category results of clustering, decreasing in number from top to bottom. The clustering results of Figure 1(a)) and Figure 1(b)) are consistent, but the difference is that Figure 1(a)) labels the literature with keywords and Figure 1(b)) is labeled with titles. Observing the two figures in Figure 1, most research on short-term passenger flow forecasting is followed by those considering forecasting methods in terms of spatiotemporal correlation, while #2 and #3 both forecast OD passenger flows and mostly use deep learning to provide algorithms for mining AFCs (more spanning lines between #2 and #3 implies more common literature in both categories). Moreover, in the last two years of research, passenger flow forecasting, especially OD passenger flow forecasting, has been studied further, revealing that there are indeed urgent problems in this area in the current period. Therefore, current research mainly tends to employ data mining algorithms for OD passenger flow prediction in urban rail transit, considering spatiotemporal correlation factors [1, 2, 813].

The urban rail transit OD passenger flow forecasting can be divided into two steps: D-point forecasting (also called OD matrix forecasting) and inter-OD flow allocation (also called inter-OD path selection). On the one hand, based on the passenger flow characteristics, passenger flow distribution patterns [14], and passenger travel preferences [7], combined with the urban rail topology network [15], it is possible to build a generalized model to measure the OD matrix of urban rail passenger flow for the prediction of passenger point of interest (POI) [16]. For example, the improved LSTM algorithm [15, 17, 18] is a more widely used method for predicting the OD matrix, and there are also nonlinear models [3], HW-DMD [19], etc. On the other hand, in the transportation domain, a trip is generally described by an OD pair, and there are usually many paths between each OD pair that can be chosen by the traveler. Initially, people may choose the path that costs the least amount of time, money, etc. to travel, i.e., the “shortest path.” However, because of the combination of different factors such as the passenger’s travel purpose, travel time [14], and the attractiveness of the destination station [2022], passengers tend to choose the path with the least cost in a broad sense, which is called the path with the greatest effectiveness in transportation science. As the path with the greatest effectiveness is continuously chosen, an increasing number of people will be on this path, resulting in increased congestion and time costs, and the effectiveness values between the shortest path and the second shortest path will gradually approach, even if the shortest path can no longer be the shortest path. Therefore, path selection probability prediction and OD demand prediction are studied as branches of research on inter-OD traffic assignment. For example, some studies have predicted the paths chosen by groups by constructing probabilistic models of path selection [23] or by matching travel time clustering to OD routes [24]; others have predicted the OD demand by constructing improved LSTM models [4, 25], improved CNN models [26, 27], or for emergency [28] or COVID-19 periods [29].

In terms of data mining depth, current research is mainly divided into the extraction of overall indicators from AFC data, such as the direct extraction of inbound and outbound passenger flow and time-of-day passenger flow from AFC data [9, 30], or the mining of travel habits of specific passengers from AFC data and the use of set counting models to conduct research at the category level [4, 7, 22, 30, 31]. However, the current models cannot match the efficient response needs of real-time systems due to the sheer volume of their parameter systems, and the information obtained based on real-time AFC data is likely to be data that has not been fully populated due to data transmission lag and cannot be mined for historical travel preferences to obtain prediction results. The multinomial logit (MNL) model [32] is based on each passenger’s choice and simulates the process of passengers deciding travel options. When passengers’ travel habits are developed, the results of the choice will be closer to the actual situation because their travel perceptions will not change extensively in a short period, which is more suitable for station forecasting of passenger flow destinations in urban rail transit. Therefore, the logit model [33] and its improved form [3437] are more interpretable and provide a more significant representation of passenger travel preferences and behaviors than the deep learning algorithm-based prediction approach described above.

In this study, we utilize a combination of data mining techniques and a logit model to predict passenger behavior for different passenger types. By analyzing massive historical automatic fare collection (AFC) data, we analyze the travel patterns of two distinct passenger groups - specific cardholders and those without prior travel data. In addition, we introduce area attractiveness to predict origin–destination (OD) matrix and identify effective travel routes. Our proposed method can be utilized for real-time passenger flow prediction in an online environment.

The paper is structured as follows. Section 2 provides a comprehensive description of the database used in the study. In Section 3, we construct the road network passenger flow OD dynamic estimation and passenger flow path assignment model. In Section 4, we demonstrate the numerical analysis approach to predict passenger flow paths and related issues. Finally, we summarize our research findings and propose future research directions in Section 5.

2. AFC Data of Network Passengers

China’s rail transit system has basically implemented the automatic collection of passenger entry and exit information for AFC systems. We take an AFC dataset of Chengdu Metro Line 2 in China as an example to describe the structure of the AFC data, as shown in Table 1.

In the current situation of urban rail transit operation, AFC data usually have problems such as missing key information and abnormal data, resulting in poor data integrity and accuracy. To improve the accuracy of data mining, the “dirty data” in the historical AFC data should also be filtered, such as ticket card data lacking key information, data with duplicate records, data with identical OD points, illogical entry and exit times, data with numerous rides in a short period, or data with long travel times that do not conform to normal travel patterns.

2.1. Site Type

To conduct an OD point analysis and identify rail stations with more intensive commuter traffic, it is essential to classify the stations. However, subdividing each of the 156 rail transit stations in Chengdu into multiple factors would require significant human resources, time, and effort. As an alternative approach, we classified Chengdu rail transit stations into seven distinct types based on the distribution of incoming and outgoing traffic over time. These classifications include: residential-concentrated, office-concentrated, residence-dominated residential-office, office-dominated office-residential, commercial-concentrated, hub, and other types, which can be found in Table 2.

From the above classification, it can be seen that the size of the inbound and outbound passenger flow of a station has a certain relationship with the attractiveness of the area around the station [22]. For example, the purpose of passenger trips in residential stations is mainly commuting to and from work, commuting to and from school, and shopping trips, while the purpose of passenger trips in office stations is mainly commuting to and from work. Therefore, different trip purposes also lead to different spatial and temporal distributions of OD between different sites.

3. Methodology

In this section, we construct an AFC-data-based passenger flow path prediction model for urban rail transit. As shown in Figure 2, the model consists of two parts: dynamic estimation of network passenger flow OD and passenger flow path assignment based on AFC data. Among them, the dynamic estimation of the network passenger flow OD model divides passenger travel data into two categories: travel habits and not-forming travel habits, and performs D-point prediction on the acquired urban rail transit route network according to the travel data categories. The passenger flow path assignment model determines the effective path set between the OD pairs by a two-way search algorithm and uses travel time, number of interchanges, and transfer time as the influencing factors, combined with the AFC data, to determine the final prediction results between the OD pairs.

3.1. Dynamic Estimation of Passenger OD Flow

To estimate the real-time passenger flow, a pattern analysis of passengers is needed to quickly find their chosen outbound station (point D) for all passengers entering the station at point O. However, in the AFC data of the urban rail passenger flow, the following four situations may occur, resulting in the unavailability of outbound station results:(1)The amount of ridership data for a passenger is too small to reach the baseline value and is judged insufficient to form a travel habit.(2)The base value is too high, resulting in the inability to filter out suitable outbound stations.(3)The same entry information cannot be found in the history data of a passenger.(4)The number of predicted D points in the output is greater than one.

We divide all AFC data into two categories: passengers who have formed travel habits and passengers who have not formed travel habits. The first three cases are grouped into the second group, and a group-level data mining strategy is carried out. For the fourth case, we narrow down the historical data matches by using count period segmentation to filter the similarity data in the time dimension.

The notations of the variables used in this section are given in Table 3.

3.1.1. Data Mining Algorithm

Among all historical records, it is clearly unreasonable to judge and classify passengers’ travel habits by only one swipe of the card data. To determine the number of baseline values, we use the data of Chengdu metropolitan rail transit in April as sample data. Out of the total data, 5 million samples were taken, and all rides with the “Tianfutong Stored Value Ticket” (a long-term card held with high travel dependency) were selected and grouped by card number. In the judgment, the number of trips of the same ID card number is selected as the base number, i.e., if the base number is set to 1, all trip records of the same card number with the number of trips greater than 1 within the data are screened, and the amount of data conforming to the base number is output and specified as “Regular Number.” In addition, the data were regressed by the name of the outbound station and compared to the last day of April to calculate the accuracy rate, and the results are shown in Table 4.

It can be seen from Table 4 that as the number of baseline values increases, the number of rides that conform to the regular number gradually decreases, but the accuracy rate gradually increases. When the baseline values are assumed to be three and four, the number of samples does not decrease significantly, but the accuracy rate increases substantially. To balance the constraints between the accuracy rate and the baseline value, we consider four as the baseline. In addition, the determined value of travel habits is calculated based on the actual situation of the Chengdu subway system ε, taking 35% of the experience value. The idea of the mining algorithm is as follows:Step 1: Obtain the passenger entry information uploaded from the real-time AFC data and match the ID card number in the historical ride record database; if there is no matching result, deal with the data with the process of D-point prediction of passengers who have not formed travel habits (the method stated in Section 3.1.2)Step 2: Count the number of rides corresponding to the ID card number and judge whether it is lower than the baseline value. If it is, we deal using the data with the method stated in Section 3.1.2. If it is greater than the baseline value, execute Step 3;Step 3: Filter the passenger inbound station corresponding to the current AFC data, and output the information of all outbound stations X1, …, , …, Xn in the specified counting period (such as month, day, etc.), obtain N1, …, , …, Nn by counting the number of times the passenger exits at each station, and calculate the travel habit determination value :Determine whether is greater than 35%, if it is greater than 35%, enter Step 4; if it is less than 35%, it is judged to be a passenger without travel habits, and the data of this ID card number are plugged into the method stated in Section 3.1.2Step 4: Count the data whose value of is greater than 35%. If there are multiple data points, refine the counting period, then find the historical ride-out stations and return to execute Step 3. If there is only one data point, execute Step 5Step 5: Output the outbound station corresponding to the value of , defined as the predicted outbound station , which is the predicted outbound station for the passenger.

3.1.2. Spatiotemporal MNL Prediction Method Based on Unformed Travel Habits

According to the classification of passengers’ travel habits, the travel data that do not reach above the baseline value indicate that the passengers corresponding to such data have not yet been explored for travel habits and cannot be pinpointed. For passengers who have not yet formed travel habits, we treat them as a group and perform a group probability distribution study because we cannot analyze the historical preference data of individual passengers.

The probability function of the MNL model describes the probability that a choice set (in our study, the choice set denotes the outbound station chosen by passenger ) will be chosen.

In general, the greater the attractiveness of the area, the shorter the travel time, and the greater the volume of passengers, the greater the probability of passengers exiting the station. Therefore, we define the effectiveness function as a linear function based on the MNL model with the following mathematical expression:

All in the above equation are fixed terms, so that in the probability function. Therefore, the spatiotemporal MNL model is constructed jointly with equations (5) and (6) to predict the outbound station (D) corresponding to a given inbound station (O).

Using the April 2018 Chengdu city rail transit data as sample data for regression analysis, it is possible to determine the parameter estimates generated by each factor affecting the effectiveness function on the choice of passenger D points, thus calibrating the effectiveness function equation (6) parameters of the D point prediction logit model.

(1) Travel Time. Clustering. The travel time of from Gaoxin station to each station is selected as the case, and because of the variability of individual samples, the average travel time is considered the value of . Meanwhile, stations with smaller outbound passenger flow will be filtered out, and the travel time from Gaoxin station to each station is finally estimated, as shown in Figure 3.

(2) Regional Attractiveness. Quantified. To facilitate the analysis, we will select the most representative station of each type, count its interstation traffic in April, and then compare the average value of the inter-OD traffic of all stations. If the actual traffic is greater than the average value, the result is recorded as 1; otherwise, it is recorded as 0. The results of the inter-OD traffic calculation for the six representative stations are shown in Table 5.

For the statistical classification of the OD passenger flow between different types of sites, the OD volume between different types of sites is compared with the average value of the OD volume of 12,297 passengers of all sites, and if the actual OD volume is greater than the average value, it is recorded as 1; otherwise, it is recorded as 0. The results are shown in Table 6.

(3) Quantification of Scale Variable. . In the AFC data of Chengdu city rail transit used in the study, the average value of inbound and outbound station traffic for all stations was calculated as 884,565, and the maximum inbound and outbound station traffic was 6,165,033 at Chunxi Road. According to the grade progression, the grade increases by one for each million increase in traffic after grade 5; before grade 5, the grade increases by one for each 170,000 increase in traffic. Thus, the total inbound and outbound passenger flow at the major stations of the Chengdu Metro from April 1 to 30 is shown in Figure 4.

Based on the maximum likelihood method for estimation [38], the values of each parameter of the calibrated effectiveness function equation (6) are taken, and the results are , , and .

3.2. Passenger Flow Path Assignment Based on the AFC Data

In this section, the effective path topology model is first established and solved to obtain the effective path set, and then the path selection model is used to calculate the selection probabilities of different paths to realize the refined passenger flow allocation.

The symbols of the models covered in this section and their interpretations are shown in Table 7.

3.2.1. Effective Path Topology Model and Solving Algorithm

Considering that the algorithm needs to conform to the actual travel habits of passengers travelling normally, the following assumptions are made:(1)Stations and each interval section of the urban railway can only be passed once.(2)The number of interchanges for passengers using urban rail transit is limited, i.e., the number of interchange stations in the effective path is limited. According to experience, the number of interchanges from the original point to the destination station is generally not more than three.(3)Passengers who use urban rail transit to travel will generally not transfer into a line again if they change out of a line when transferring. That is, the paths in the effective route are continuous on each rail line.(4)If the passenger’s OD point is on the same route, the passenger will only travel on that route, i.e., if the OD point is on the same route, the valid trail is also on the same route (hypothesis 3 and hypothesis 4 are complementary to each other).

In turn, the rail network is transformed into a directed connectivity graph G = <, E, T> to describe the rail network model by hierarchical sequencing of the network, where is the set of stations, E is the set of intervals, and T is the set of interchange virtual intervals [38, 39].

Given any OD points, let the set of ordered intervals contained in the valid path be , where the actual interval ordered set is , and the virtual interval ordered set is . Then, the valid path should satisfy the following conditions:where and denote the lines belonging to the virtual intervals connecting the OD stations and , respectively. Equation (12) represents the two real intervals and adjacent to each other in the ordered set . The of the former is the same as the of the latter to ensure the continuity of the effective paths on the same line. Similarly, equations (13) and (14) ensure the continuity between the actual and virtual intervals of the effective paths during the interchange process. Equation (15) represents any two different virtual commutation intervals and , in the effective path, with of the former being different from of the latter, so that the effective path satisfies the basic assumption (3). Equation (16), limiting the number of interchanges satisfies the basic assumption (2).

To reduce the complexity of the algorithm and solve the above effective path topology model, we store the road network information in the station number in advance, omit the step of introducing the adjacency matrix, reasonably use the feature that the number of interchanges does not exceed 3 times, and adopt the “two-way search algorithm” with both O and D points as the starting points as the effective path set solving algorithm. The steps of the “two-way search algorithm” are as follows:Step 1: Initialize the effective path set , original station O, destination station D, ( is the total number of stations), , , , , and .Step 2: Determine the adjacent interchange stations , , , and according to the original station O and the destination station D. If one end is the end station or is itself an interchange station, only one adjacent interchange station needs to be determined. Based on the line where the two adjacent stations are located, determine the line where the station is located for comparison. Then, determine whether the OD points are on the same line (based on our line number); if yes, then go to Step 5; if not, then go to Step 3.Step 3: Cross-determine whether adjacent interchange stations are on the same line, discriminate up to four groups in total: (, ), (, ), (, ), and (, ), and denote their order by . Initialize . If on same line, a valid path is found. For example, if (, ) is discriminated on the same line, then a valid path expressed by interchange can be determined, and the path is stored in the set of valid paths . Meanwhile, . If not on a line, let . When , go to Step 4.Step 4: Since the algorithm specifies that the maximum number of interchanges is three, when adjacent interchange stations are not on the same line with each other, to determine a valid path, one must find a station that satisfies the following requirements: the station is simultaneously on the same line with one of the adjacent stations at point O and on the same line with one of the adjacent stations at station D. Therefore, search for each of the four groups of adjacent stations, reinitialize , starting from (, ): search for line where station is located, search for line where station is located, and then search for stations that belong to both and , i.e., interchange stations of the two lines. If exists, a valid path represented by interchange stations can be determined, and the path is stored in the set of valid paths while ; if it does not exist, let . When , go to Step 5.Step 5: Initialize , starting from the first path in the set of valid paths and determine each station and interval passed along the way. First, determine the specific route from the original point O to the adjacent interchange stations in the valid path, which shall be marked by two stations, determine the up and down direction, and retrieve the stations between the two stations together with the two stations deposited in the station set . Second, retrieve the stations between the next two interchange stations by the same method until the end point is reached, and store the stations in the station set . Third, by the order of stations in the station set , according to the line interval set , retrieve the square and conforming interval numbers between two stations in turn and deposit them in the interval set . Let , and when , go to Step 6. Otherwise, repeat the above steps.Step 6: There are still many invalid paths obtained by the above algorithm because they do not satisfy the assumption (1) that a station or interval can be passed only once. Therefore, initialize again and determine whether there are duplicate items in the set of stations and the set of intervals. If there is, delete the th path in the set of valid paths (where ). Let ; when , go to Step 7. Otherwise, repeat the above steps.Step 7: Output the final set of valid paths , and the algorithm ends.

3.2.2. Path Selection Model and Path Effectiveness Function

The passenger flow assignment problem is also commonly described as a path matching problem, where the probability of a passenger choosing a particular path reflects the degree of matching between the passenger flow and the path and can be expressed as the percentage of passengers choosing this path among all passengers. On the other hand, in transportation field research, the effectiveness function generally refers to the broad cost of a certain transportation mode or a certain path, which represents the functional relationship between the travel impedance perceived by the traveler and the travel influencing factors. Therefore, the path selection model is constructed with the basic logit formula as follows:

Clearly, the selection probability has the following properties:

The probability of path selection is related to the distribution of the random error term and the path cost . To reduce the irrationality of network traffic distribution, the relative cost can be used to calculate the selection probability. Assuming that the are independent of each other and obey the distribution, the path selection probability can be expressed in the following logit form by substituting the above equation into equation (5) in Section 3.1.2:

Based on the travel characteristics of urban rail transit, the main influencing factors considered by passengers in the process of path perception and selection are travel time, ride time, and the number of transfers. Since the current urban rail transit control system basically achieves a certain control accuracy and can ensure that the trains run according to the interval running map and train schedule, the ride time is regarded as a fixed constant that can be obtained from the train running map or train schedule. Therefore, the fixed term of random effectiveness in the path effectiveness function can be measured by three indicators: travel time, transfer time, and the number of transfers.

(1) Calculation of Travel Time. . We estimate the passenger flow distribution of multiple paths by determining the single-path passenger flow distribution. We used the travel time and number of passengers from 8:00 am to 10:00 am from April 10 to April 12 as the data samples for the Chengdu subway station “Xipu” to “Chunxi Road” and determined the travel time distribution function. The parameters of the distribution function were defined. The length of the interval was set at 30 s, and the statistical results are shown in Figure 5, using the “98th percentile” theory to eliminate the extreme minima at both ends.

Using hypothesis testing and the great likelihood estimation method, we determine that the travel time of the single path OD obeys a log-normal distribution within the interval and has a parameter value . Thus, the mathematical expectation of the travel time from Xipu to Chunxi Road is s.

The travel time probability distribution of multipath OD is the accumulation of different parameters of the normal distribution. Taking the OD from Xipu to South Railway Station as an example, there are two valid paths between this OD point pair. We establish a system of quadratic equations by using the data between the extreme value points of the frequency of Figure 6 (the red bar graph in the figure) as the data for the calculation of the system of equations, solving for the parameters of the normal distribution of the two paths, and solving for . Therefore, the travel time expectation of path 1 is obtained: s; the travel time expectation of path 2: s.

(2) Calculation of Interchange Time. . Since the moment that passenger arrival at the platform is totally random, it is assumed that the arrival of passengers follows a uniform distribution over the interval [0, ]. Thus, the mathematical expectation of the passenger transfer waiting time at the platform is .

Then, the interchange time calculation formula can be expressed as follows:

(3) Calculation of the Number of Interchanges. . The number of interchanges can be determined directly from the calculation results of the effective path search algorithm, and the algorithm is not described here. Note that if the path contains a virtual interchange arc, the interchange station is not counted in the number of interchanges.

In summary, the effectiveness function of path for the broad cost, measured in terms of travel time, transfer time, and number of transfers, is as follows:

Equations (19) and (21) together form the route selection model where is the interchange cost and and are parameters to be determined. Since the negative perception of passengers increases exponentially with each increase in the number of interchanges, is an exponential parameter.

When there are multiple paths between ODs, it is necessary to study the selection behavior of passengers based on the elements of the path set. When passengers choose travel paths, they usually do not stand on the road network to consider all paths but choose from a part of the paths. Although we search the effective path set by a two-way search algorithm, the path set still contains too many paths, and in the actual passenger selection, one to three paths usually reach the limit. To find a subset of the valid path set, a stretch factor is attached to all paths. Then, the subset of valid paths satisfies the following conditions [39, 40]:

Therefore, after substituting the effectiveness function into the path selection function, the path selection model has four pending parameters: , , , and . Using Chengdu City’s data for calibration, we obtain  = 1.2720,  = 1.8623,  = 0.25, and  = 1.840.

4. Results and Discussion

Since our model serves to judge the distribution of commuter traffic within the rail network during the commuter peak period, the data used in this section should select a station with high commuter traffic and an incoming passenger flow of ten minutes during the commuter peak period. Therefore, we chose all incoming swipe information from the Gaoxin station during 8:20 am–8:30 am on April 9, 2018, as the simulated real-time AFC upload data. In addition, to facilitate the observation of the regularity of the data, we selected the first four types of stations with a high number of outgoing stations in Table 2.

4.1. Outbound Station Prediction for Type I Passengers Based on Historical Travel Habits

During the period of 8:20 am–8:30 am on April 9, 2018, there were 99 swipe card data points entering the station at Gaoxin Station, distinguished by the ID card number of the incoming swipe card, indicating that 99 passengers entered the station. After filtering out the unrecorded card data and filtering out the passengers with a travel factor greater than 4, the remaining records are 67. Due to space limitations, we could not spread all ridership information here, so we chose two of the ridership data, as representatives to compare the results.

Passenger A and passenger B have historical AFC records, as shown in Table 8. Passenger A made 69 trips in a month, including 29 trips at Gaoxin Station; Passenger B made 49 trips in April, including 25 trips at Gaoxin Station.

Calculate for the two passengers at their respective outbound stations in April, as shown in Table 8. For Passenger A, since the only station with > 35% is People’s Park, Passenger A is predicted to leave the station at People’s Park. For passenger B, since >35% corresponds to the two stations of Gaopeng Avenue and Hongxing Bridge, the time in the historical AFC data should be subdivided again, and all historical ridership data from 8:00 to 9:00 a.m. In the entry time of this card number history should be filtered, as shown in Table 8, and passenger B’s travel habit determination value should be calculated again, and the one that exceeds 35% is Gaopeng Avenue, so passenger B’s predicted outbound station is Gaopeng Avenue.

The real ridership records of passenger A and passenger B on the day of April 9, 2018, are shown in. From the exit information in Table 9, the outbound station that this predicted passenger would choose is the same as the actual outbound station.

Calculating the predicted results of passengers’ outbound stations in that period corresponding to the above 67 data points, there are only 4 data points whose predicted stations do not match with the actual stations selected by passengers, and the prediction accuracy rate λ = 94.03%. This mining algorithm is more accurate and reliable in calculating passenger outbound station selection for commuter flow.

4.2. Outbound Station Prediction for II Passengers Based on a Spatiotemporal MNL Model

For passengers who have not yet formed a travel habit, the examples are mainly passengers with one-way tickets and passengers with Tianfutong stored value tickets, Tianfutong cash cards, and Tianfutong regular CPU cards with less than four total trips in the historical AFC data.

Substituting the parameter values into the effectiveness function (6) of the spatiotemporal ML model is expressed as:

According to the travel time from the Gaoxin station to each station in Figure 3, we can see from equation (26) that the effectiveness function has a negative relationship with the travel time of passengers, so when is larger, the probability of passengers choosing the station is smaller, so the stations with travel time ≥ 30 min are screened out first because their travel time is too long, so the probability of passengers choosing the station will be greatly reduced. In addition, when the travel time between two stations is too short, the possibility of passengers choosing other travel modes, including bus, walking, or bike-sharing, increases greatly, thus filtering out stations with travel time .

Meanwhile, referring to Tables 2 and 6, the Gaoxin station is an office-concentrated station, thus calculating the and of each outbound station corresponding to the Gaoxin station, as shown in Figure 7:

From the Figure 7, we can see that if we use the Gaoxin station as the inbound station for prediction, the vast majority of passengers will choose the station with a larger probability value and effectiveness function as the outbound station, i.e., the station with the largest number of outbound passengers in this example should be Chunxi Road, Chengdu East Passenger Station, North Train Station, Tird Tianfu Street, and Provincial Stadium.

Extract the real card entry information for the corresponding date of Gaoxin station, screen out the stations with fewer than 90 exiters, sort the outbound stations according to the probability distribution, and obtain the following: Figure 8. Figure 8 shows that Chunxi Road, Chengdu east passenger station and north train station have the highest number of exits, with third Tianfu street and the provincial stadium ranking slightly differently. However, the change in traffic between third Tianfu street and the provincial stadium is not very different, so overall, the forecast results are more in line with expectations.

From the statistics, we can see that after a certain period of time, Chunxi Road, Chengdu east passenger station, and north train station will usher in a small peak of passenger flow, and the staff at these stations can deploy and plan the route of passengers in advance and conduct passenger flow diversion work at the right time to help the passenger flow evacuate quickly and avoid the formation of congestion.

4.3. Passenger Final Route Prediction

Most of the OD points in the preceding example are on the same urban rail line, and the distance is relatively short, which is not enough to illustrate the problem of multiple path selection. We reselect the “Chadianzi” station of Line 7 as the O point and the “Yinghui Road” station of Line 7 as the D point as the path prediction example in this section, as shown in the black inverted triangle in Figure 9.

Figure 9 shows the Chengdu subway network after our numbering process, where the line numbers follow the operating line numbers except for Line 7, and the black circles represent interchange stations. The numbering is discontinuous because this example focuses on the lines within the loop of Line 7 while omitting Line 10 and the branch of Line 1 from Sihe to Wugensong, which have no line crossings. In Figure 9, Line 7 is a loop and contains several valid paths such as direct and detour in the path from point O to point D. One of the bypass paths violates the above valid path assumption (4), so we break the line containing arcs or loops into several branches for path passenger distribution at appropriate places. In Figure 9, the interrupted stations are numbered 1, 3, 5, and 7, corresponding to the operating stations of Yipin World, Yima Bridge, Chengdu East Passenger Station, and Taiping Park, forming lines 7A, 7B, 7C, and 7D, respectively.

The following is the search process for the set of valid paths based on our proposed “two-way search algorithm.”Step 1: Determine the adjacent interchanges or terminal stations at points O and D, respectively. Based on the subordinate relationship between the line and the station, i.e., ( for line, for station), the adjacent interchange stations at the original point O can be determined as ① and ②, and the adjacent interchange stations at the ending point D as ④ and ⑤.Step 2: Determine the lines where each adjacent interchange is located separately. Interchange ① is subordinate to Line 2, Line 7A, and Line 7D; Interchange ② is subordinate to Line 1 and Line 7A; Interchange ④ is subordinate to Line 4 and Line 7B; Interchange ⑤ is subordinate to Line 2, Line 7B and Line 7C.Step 3: Crossover determines whether each vector interchange station is on the same line. Stations ① and ④, stations ② and ④, and stations ② and ⑤ are not on the same line, so a further search for interchange stations is needed. Both stations ① and ⑤ are located online 2, so the route O  ①  ⑤  D can be determined.Step 4: Search for a valid path containing three interchanges. The search process is described in Table 10:Step 5: According to steps three and four, remove the paths containing duplicate segments to obtain the final set of valid paths and complete the search. The final set of valid paths contains the following 7 entries, as shown in. Paths 3, 5, and 7 contain virtual interchange arcs (7D, 7B, etc.), thus reducing the number of interchanges compared to the representation of paths. Substituting the above metrics into the route selection model, the broad cost values are obtained in Table 11:

Since the effective paths searched by the above algorithm are still relatively large and the paths considered by urban rail passengers are often only 1∼3, the travel time, transfer time, and number of transfers for each path are calculated by further reducing the stretch factor of the path. According to the stretch factor of the path, when the broad cost of a path is greater than times the minimum broad cost, the path will not be considered by the traveler and should be removed from the set of valid paths. When takes the calibration result of 0.25, the path with the smallest broad cost is valid path5, and all other paths are eliminated. After substituting the path selection model, the passenger flow matching probability of path O  ②  ③  ④  D is 100%, i.e., according to our algorithm, all passengers travelling between Cha Dian Zi Station and Ying Hui Road Station will choose Line 7 directly as the only path.

Further analysis reveals that path 5 contains three virtual transfer arcs (line 7A and switching line 7B) with the shortest travel time, zero transfer time, and zero number of transfers, thus meeting the actual path selection willingness of travelers. In the other OD selection cases, there are multiple effective paths with closer broad costs, and multiple path selection results with less than 1 allocation ratio can be obtained by our proposed method. In summary, our passenger path assignment algorithm largely proves to be accurate and effective.

5. Conclusions

Our research aims to predict urban rail traffic, specifically in terms of the destination stations and travel routes that commuters will choose. To achieve this, we focused on commuter traffic as our research object, as it has a high proportion and strong travel regularity. We utilized passenger entry information from rail transit stations with a high proportion of commuter traffic, and our contributions are outlined below. First, we divided passenger flow into two categories based on the formation of travel habits and performed OD prediction using a combination of data mining and logit modeling. As passenger flow can be unstable, we split the flow into passengers who have formed travel habits and those who have not yet formed these habits. For the first group, we utilized a mining algorithm based on historical travel habits to predict their travel destinations using historical AFC data. For those who have not formed travel habits, we mainly used a modified ML model to predict the most likely outbound station a passenger will choose when entering a station, considering spatiotemporal influences such as travel time, regional attractiveness, and OD size. Second, determining a passenger’s choice path between two points based on OD is a key step that requires designing efficient algorithms to find complete and effective paths. To do this, we assumed that the number of interchanges would not exceed that when passengers chose a route and that there was an effective route. Using a “two-way search algorithm,” we searched adjacent interchange stations and line interchange stations from the origin and destination of the OD pair at the same time, making full use of interchange stations to implement network topology modeling. This approach allowed us to quickly search for a complete and effective route, which we verified through experiments. Last, our algorithm exhibits good generality and can be applied to rail transportation networks in different cities. The forecasting model that we developed is a service for urban rail operators and passengers who use urban rail to travel. Our model aims to provide a holistic forecast of commuter flow in terms of travel stations and tracking and analyzing the travel destinations of each passenger. In addition, it provides detour information for traffic participants to avoid congested stations and supports decision-makers on current and next-period passenger flow conditions to respond to unexpected situations. While our research has made important contributions, some problems can still not be solved due to limited capacity. For example, in the final example analysis, we used a dummy variable to mark the influence factor of regional attractiveness. In future studies, a distribution function could be introduced to quantify regional attractiveness.

Data Availability

The data used to support the findings of this study are included in the article. Should further data or information be required, these are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Science Foundation of Liaoning Province, China, under Grant No. 2023-MS-273.