Abstract

Secondary crashes (SCs) are typically defined as the crash that occurs within the spatiotemporal boundaries of the impact area of the primary crashes (PCs), which will intensify traffic congestion and induce a series of road safety issues. Predicting and analyzing the time and distance gaps between the SCs and PCs will help to prevent the occurrence of SCs. In this paper, a combined data-driven method of static and dynamic approaches is applied to identify SCs. Then, the random forests (RF) method is implemented to predict the two gaps using temporal, primary crash, roadway, and real-time traffic characteristics data collected from 2016 to 2019 at California interstate freeways. Subsequently, the SHapley Additive explanation (SHAP) approach is employed to interpret the RF outputs. The results show that the traffic volume, speed, lighting, and population are considered the most significant factors in both gaps. Furthermore, the main and interaction effects of factors are also quantified. High volume possibly promotes the time and distance gaps, while low volume inhibits them. And volume affects the distance gap inconsiderably when it falls between 300 and 400 veh/5 min. Traffic conditions with high speed and low volume are strongly associated with short-time and short-distance gaps. Darker surroundings probably accelerate the occurrence of SCs. Moreover, crashes involving the violation categories of improper turns or unsafe lane changes likely result in long time and distance gaps. These results have important implications for proposing traffic management and improving road safety.

1. Introduction

Road traffic crashes pose a threat to normal traffic operations and safety and can cause property damage or even serious injuries. According to the world health organization [1], approximately 1.3 million people die each year as a result of road traffic crashes. Between 20 and 50 million more people suffer nonfatal injuries, with many incurring a disability. Furthermore, road traffic crashes cost most countries 3% of their gross domestic product [1]. SCs, happening in the spatiotemporal impact area of primary crashes (PCs), commonly result in an additional impact on traffic and extra personal injury [2, 3]. According to [4], SCs can account for 20% of all crashes and 18% of all fatalities on freeways in the United States. In this context, SC prevention has become a major consideration in the traffic safety field.

In the past decades, a large body of literature has been devoted to investigating the identification of SCs and modeling the risk of SC occurrence [513]. Various statistical and machine learning (ML) methods were applied to explore these two aspects of SCs [912]. However, the time gap (i.e., the time difference) and distance gap (i.e., the spatial separation) between an SC and the corresponding PC have received less attention, which might hinder a better understanding of the possible time and location of SCs. Among the few methods applied to study these two gaps, statistical approaches subjected themselves to the possibility of predicting infinitely large gaps [14, 15], while ML methods failed to provide satisfactory prediction performance on the distance gap [16]. Moreover, the black-box models need more explanation to discuss the effects of contributing factors in detail [16]. Therefore, some promising methods and data experiments are required.

To better capture the characteristics of SCs, we first developed a hybrid method (i.e., static spatiotemporal threshold-based and speed contour map-based methods) to identify SCs and obtain the time and distance gaps. Subsequently, random forests (RFs) were used to predict the time and distance gaps, which have high prediction performance and diversity. And an interpretation technique, namely the SHapley Additive explanation (SHAP) approach, was applied to examine the model outputs and estimate the global and local effects of the influencing factors. Understanding time and distance gaps and their influencing factors can provide management strategies for transportation agencies and improve traffic operations and road safety.

2. Literature Review

2.1. Secondary Crash Identification

Overall, two types of methods, static and dynamic methods, were widely used to identify SCs. Static methods identify SCs by setting the fixed spatiotemporal thresholds, which means crashes are identified as SCs if they fall within the spatiotemporal thresholds of another crash [17]. First introduced this method and defined the thresholds equivalent to one mile upstream of a PC and 15 minutes after clearance time. Following this study, further research associated with static methods has been explored [57, 18]. For example, some studies proposed a spatial threshold of 2 miles and time thresholds of 2, 1, and 2 hours, respectively, to identify California secondary crashes [1921]. SCs can be selected quickly and effectively from massive crashes according to spatiotemporal thresholds [2, 16]. However, static methods have the problem of subjective judgment: overestimation or underestimation of the thresholds [2, 22]. As an improvement [7], we introduced three sets of spatiotemporal thresholds to identify SCs on Florida interstates. The spatial thresholds for all three sets were 2 miles, and the time thresholds were 2 h, 15 minutes, and 30 minutes after the PCs’ clearance time. Their results confirmed that the identification ratio of SCs varied for different sets.

With the support of various sensor technologies, dynamic methods are becoming increasingly popular and used because of an improvement in the misclassification of SCs [22]. There are three main dynamic methods: (a) queuing theory-based method [23, 24]; and (b) shockwave-based approaches [25, 26]; (c) speed contour map-based method [11, 13, 18]. In practical application, due to the data quality and quantity requirements of methods (a) and (b), the models are often simplified and set assumptions, failing to reflect the actual condition in the real world. Nevertheless, the speed contour map-based method has performed well without any simplification or assumptions since it can accurately capture the impact area of PCs [13, 27, 28]. For example, [18] compared the crash state speed with the historical average speed to brighten the impact area. Likewise [11, 13], we applied this method to identify SCs and considered recurrent congestion.

In summary, static methods are easy to implement and quickly obtain identification results, while dynamic methods achieve better performance but consume a lot of computational time. Combining these two methods for SC identification can improve efficiency and accuracy [16, 25]. This paper proposes a two-stage strategy to identify SCs by incorporating the fixed spatiotemporal threshold-based and speed contour map-based methods.

2.2. Secondary Crash Risk Modeling and Predicting

Several statistical and ML models have been applied to explore the relationship between SC occurrences and contributing factors [912]. For example, [10] proposed a logit model to predict SC likelihood, and their results revealed that rear-end crashes could increase the SC likelihood [11] developed a random effects logit model to link the probability of SCs with real-time traffic volume conditions, primary crash characteristics, environmental conditions, and geometric characteristics. Similarly, [29] used the Bayesian complementary log-log model to predict the likelihood of SCs and examine their relationship with several variables.

However, previous studies focused less on the time and distance gaps between the SCs and PCs. Several studies have made attempts using regression approaches. For example, [14] selected the ordinary least-squares (OLS) regression to model the two gaps separately. Their results showed that time and distance gaps were closely associated with collision type and the duration of the primary crash. Likewise, [15] applied OLS regression to evaluate the relationship between the time and distance gaps concerning individual crash characteristics. They found that the number of lanes, total vehicles involved in the crash, morning time, and AADT were the most significant factors affecting time and distance gaps. Although most independent variables had a high significance, traditional statistical models usually made more prior assumptions for input variables, and they were unable to predict the possibility of massive gaps. Moreover, [14, 15] built an independent regression model for the time and distance gaps, ignoring the potential correlation of the two gaps because they happen at the same time. Therefore, it is necessary to consider an alternative model to investigate gaps simultaneously.

By contrast, ML methods have become increasingly attractive and have gained more attention due to their high prediction power and low limitation on data [30]. Multiple ML methods have been employed in traffic safety studies [8, 13, 16, 29], such as neural network models, genetic algorithms, random forests, XGBoost, etc. In a small number of studies on the time and distance gaps [16], the authors utilized a linear regression model and two ML algorithms, including a back-propagation neural network (BPNN) and the least-squares support vector machine (LSSVM), to build three prediction models. The results indicated that the BPNN and LSSVM models outperformed the linear regression model, but these two ML models also failed to provide adequate performance on distance gap prediction. Regarding ML models, many other promising approaches, such as ensemble algorithms, combine several base learners to enhance the prediction performance [3133].

Besides, relatively fewer studies have focused on SC prevention. As [2] summarized, available data and high costs have limited relevant investigations, so continued endeavors are still needed. The main objective of this study is to develop a reliable model to predict the time and distance gaps and analyze associated influencing factors, which can help with proactive prevention and improve safety. Several existing research gaps and insufficiencies were mitigated and supplemented in this study.

3. Data Preparation

In this study, crash data were collected from the Statewide Integrated Traffic Records System (SWITRS), which records detailed description of crash-related information, such as the unique case identifier, location (state route, postmile, latitude and longitude), collision year and time, collision severity and type, lighting, weather, etc. A total of 24643 crashes were collected from freeways I-10, I-5, US-101, I-210, and I-110 in Los Angeles County of California over four years, from June 2016 to December 2019 [34]. Through a detailed examination, we removed the issues of redundant attributes and missing values from the crash data.

In order to combine real-time traffic data into the analysis of crashes, volume, and speed were extracted from the caltrans performance measurement system [35]. In PeMS, data were gathered from a set of loop detectors on the road and transmitted to the management center for storage. And the configuration information of the detector was integrated, including the location and unique identification number. A two-step matching strategy is devised to obtain traffic volume and average speed for each crash. The first step matches the nearest detector upstream for every crash based on the latitude and longitude of the crashes and the loop detectors. The second step is extracting the volume and speed for 5 minutes before the crashes.

Referring to the previous studies on SCs [14, 16], 17 variables were selected from 4 dimensions. Specifically, temporal characteristics consist of 5 variables, namely, peak, weekend, weather, lighting, and population, which reflect the environment's state. Population density has a relationship with vehicle trips [36, 37]. Primary crash factors include 8 variables: collision severity, collision type, violation category, part count, etc. These variables demonstrate all the information associated with a crash. Road condition and surface reflect the roadway characteristics, including whether the pavement is a maintenance area or free from abnormal conditions or whether the pavement is dry/wet. Traffic volume (veh/5 min) and speed (mile/h) report the traffic characteristics. Detailed descriptions and statistical information are expressed in Table 1. Additionally, the Pearson correlation coefficients (PCCs) were applied to examine the multicollinearity between the 17 variables. Figure 1 demonstrates the computed results. As shown, all the absolute values of PPC are less than 0.8, indicating a low linear correlation between variables.

4. Methodology

4.1. SC Identification

The identification of SCs is the basis for conducting SC modeling and analysis. The static spatiotemporal threshold-based estimation is the first stage to identify SCs roughly, and it can be defined in the following equation:where denotes the location and occurrence time of the crash A, denotes the location and occurrence time of another crash B that needs to be examined, denotes the defined time threshold and spatial threshold, and the value of 1 means that crash B is identified as a secondary crash corresponding to crash A and 0 otherwise.

Speed contour map-based method estimates the impact area of the PC based on the change in traffic speed, and a SC is identified when it is discovered in this area. The speed contour map comprises grid cells split by defined time intervals and the milepost of sensor stations [2]. The impact area can be ascertained by checking the speed of each cell near the crash. In general, it can be written as the following equation:where and denote the current and the reference speed of one cell; denotes that the cell is affected; and denotes that the cell is not affected. The size of the impact area was determined by the reference speed . The detailed procedures of the identification method are as follows:(i)Apply the fixed spatiotemporal thresholds to identify the candidate SCs. Referring to previous studies on SC analysis in California [1921], 2 miles and 2 hours were selected as the thresholds in this study. The initial identification on 24,643 crashes has yielded 563 possible SCs.(ii)Extract the 5-min speed data to develop a speed contour map for a potential PC. More specifically, given the fixed spatiotemporal thresholds that have been determined, the time period for extracting speed data is between 2 hours before and 2 hours after the PC, and the spatial period is 2 miles upstream and 2 miles downstream of the PC location. To eliminate the effects of recurrent congestion, the historical average speed was calculated by collecting speed data from the PC-free days in a year [13, 18].(iii)Estimate the impact area of a potential PC using equation (2). The crashes that occur in the impact area of PC are identified as SCs.

Following the two-stage identification method, 368 SCs are identified in this study. The ratio of the number of SCs to the number of all crashes is 1.49%, which is consistent with the findings of the references in this area that this ratio is around 1–1.6% [1113, 18, 25, 3840].

4.2. Random Forests

This study used RF to predict the time and distance gaps, which has been widely used in the transportation field [4146]. RF uses a bootstrap sampling method to change the training set to build an integration of regression trees [47]. Such a mechanism expresses the following advantages: gaining higher performance. Furthermore, RF can perform multiple output modeling [48, 49], which is suitable for simultaneously predicting the time and distance gaps.

The input vectors for the RF model are represented as . and are the number of features and samples, and indicate the time gap and the distance gap of sample , respectively. Figure 2 expresses the structural framework of RF, which consists of the following three parts: (1) Sample set selection: using the resampling method times on the original dataset to generate a sample set. In other words, some samples are likely to be chosen multiple times, while others will not be selected once. After rounds of extraction, new sample sets are obtained. (2) Decision trees generation: training decision trees using sample sets of data. During each round of generating trees, variables from features are selected for training. The randomness of the training data and variable combinations improves the prediction performance of the model and essentially prevents overfitting. (3) Result combination. Since the decision trees generated are independent, they have the same contribution to the predicted result. Therefore, the final result is obtained by averaging the predicted results. For multioutput problems, the following changes are required in the decision trees: First is to store several output values instead of 1. Then use splitting criteria that calculate the average reduction across all outputs.

4.3. SHAP Method

ML methods commonly demonstrate an outstanding prediction performance, while their abilities are limited due to their low interpretability. Although the RF model can obtain global explanations (i.e., the relative importance), it cannot quantify local explanations for individual predictions. Nevertheless, local explanations provide more detailed information than global ones [50, 51]. Shapley additive explanations (SHAP) technology is a representative local interpretation method that can explain the main local effects and interaction effects of independent variables on dependent variables, as proposed by [52]. Furthermore, [53] improved SHAP to better and faster explain tree-based ML models, such as random forests and gradient boosted trees.

SHAP value is the core of the method which is computed based on the game-theoretic approach, and it represents the average marginal contributions of one variable on a single prediction. SHAP value is defined as the following equation:where indicates the set of all variable orderings, represents the set of all variables that rank before the variable in the ordering , is the number of variables, means the values of explanatory variables, and refers to the single prediction, which can be written by the following equation:where means the base value, i.e., the average value of overall predictions.

Additionally, the global importance of variables is the sum of the contribution of one variable on all predictions, which is calculated by averaging absolute SHAP values as shown in the following equation:where represents the importance of variable , indicates the SHAP value for variable in the single prediction , and is the number of all predictions.

The proposed RF model and SHAP method were mainly implemented in Python (3.8.8) using scikit-learn (0.24.1) and shap (0.40.0). The SHAP package contains three applications: force plot, summary plot and dependence plot. In this study, we apply the summary plot to describe the importance of each variable and the dependence plot to reflect the main effects and the interaction effects of all variables.

5. Results and Discussion

5.1. Results

In this study, the grid-search with 5-fold cross-validation techniques (i.e., GridSearchCV) was used to determine the core parameters of the RF model. Table 2 reports the optimal values of the parameters. In the application, the proposed RF model is compared with two traditional multivariate models: the K-nearest neighbor (KNN) model and the multilayer perceptron regression (MPR) model. All the models were trained and validated by applying the same dataset to guarantee the reliability of the comparison results. Specifically, at a ratio of 7 : 3, the raw samples were split into a training set and a testing set for training and testing model. Two classical regression evaluation measures, namely, mean absolute error (MAE) and mean squared error (MSE), were used to assess model performance. The final evaluation results are presented in Table 3. As shown, the RF model mostly outperformed the other two models on both the training and testing sets in terms of predicting the time and distance gaps.

5.2. Global Importance of Variables

Figure 3 visualizes the global importance of variables on the time gap. In the left part, variables are sorted in descending order according to their global importance, computed by averaging their absolute SHAP values per variable. The left x-axis indicates the . As shown, lighting is the most dominant variable on the time gap, and its average effect on the predicted value is 0.11, followed closely by volume and speed, which change the predicted value by 0.093 and 0.056, respectively, on average. It suggested that the traffic characteristics significantly affect the time gap. This finding is not surprising; Traffic characteristics are the direct response of the traffic state, which largely influences the travel surroundings and driver status. As [11] indicated, more than geometric characteristics and primary crash characteristics, traffic characteristics could significantly affect the SC likelihood. Subsequently, population has a greater contribution than party count and collision severity, indicating that the temporal characteristic of population impacts the time gap more than the primary crash characteristic. By contrast, the roadway characteristics of road surface and condition have a substantially minor effect on the time gap, with the less than 0.005.

In the right part, the diagram consists of points representing the samples, and the color visually reveals the magnitude of variables (red means a high value, while blue means a low value). The right x-axis indicates the SHAP value, which refers to the effects of all variables on a single model output (i.e., the local effect). This diagram roughly illustrates the variation of effects with the change of either variable. Taking lighting as an example, its left side of the vertical axis is covered with red points (indicate dark) and its right side is stacked with blue points (refer to daylight). This demonstrates that night may decrease the time gap, while the daytime probably promotes the time gap. In addition, high volume (red points) mostly has a positive SHAP value and low volume (blue points) mainly has a negative one, revealing that high volume promotes the time gap while low volume inhibits it.

Figure 4 represents the global importance of variables on the distance gap. As shown in the left part, volume is the most significant contributor and has an overwhelming effect on the distance gap, changing the predicted value by 0.136. Definitely, volume size directly influences the length of the vehicle queue and, thus, the distance gap between the PC and SC. Lighting, speed, and population also rank at the top of the importance list. Road surface and condition are in the bottom third and second places. Generally, the importance ranking of variables for the two gaps is different, but there are overall similarities. Traffic features are always the most important. Crash and temporal characteristics are commonly distributed throughout the importance list. And road traits contribute relatively small to both time and distance gaps. Regarding the right part, it shows that high volume, daylight, enormous speed, and a dense population have a positive SHAP value, possibly increasing the distance gap.

5.3. Local Effects of Variables

In previous studies, the local effects of a particular variable on the predicted outcome are often observed assuming that other variables are constant. The drawback is that this way does not consider the issue that the changes of specific variable likely cause variations in other variables (rather than assuming that all other variables are constant). The local dependence plot obtained based on the SHAP method can quantify the variables’ effects while avoiding the abovementioned disadvantage. The main effects were calculated for each variable. In addition, considering the nontrivial effects of traffic characteristics on the time and distance gaps (see Figures 3 and 4 in the previous section), their interaction effects with the rest of the variables were also estimated. In this section, we select variables with strong effects for analysis.

Figure 5 shows the local dependence plots for volume on the time and distance gaps. Specifically, the first two plots reveal the main effects of volume, and the last two reflect the interaction effects between volume and speed. Moreover, the left column is for the time gap, while the right column is for the distance gap. In each plot, every point corresponds to a sample. The x-axis represents the volume value; the left y-axis indicates the SHAP value (i.e., the local effect); the right y-axis and the different colored points in the last two plots describe the speed value. As shown in Figures 5(a) and 5(b), plots for volume reveal an overall upward trend. When volume is around 100 veh/5 min in the two plots, its local effects remain at the negative highest level, suggesting that low volume may lead to a sharp decline in the time and distance gaps. One possible explanation is that low volume allows for such long distances between vehicles that drivers tend to relax their vigilance generally. When faced with a sudden crash, they are likely to react slowly and are unable to stop timely at high speed (as shown in the lower-left corner of Figures 5(c) and 5(d), the corresponding speed is around 65 mph). Another reasonable interpretation is that low volume does not contribute to long queue length formation, thus creating a short-distance gap. As volume grows to 500 veh/5 min, its local effects remain at the positive highest level, indicating that high volume is likely to rapidly increase the time gap and distance gap. This finding is consistent with existing works [15]. The reason might be that high volume makes the traffic situation entirely stressful, and drivers have developed a cautious driving style under this circumstance. When a PC occurs, drivers in the immediate vicinity upstream will not feel large shock, so SC does not occur as quickly. Moreover, high volume can prolong queue length and thus increase the distance gap. When volume is around 500 veh/5 min, its corresponding speed falls in an extensive range of 24–76 mph.

Figures 6(a) and 6(b) show the main local effects of speed on the time and distance gaps, respectively. The trends in the two plots are similar in general (down then up), but the inflection points correspond to different speed values. In Figure 6(a), as speed ranges between 0 and 50 mph, its local effects on the time gap decline to negative from positive as it increases. When speed falls 50–75 mph, its local effects show a steep upward trend. As for Figure 6(b), when speed increased from 0 to 30 mph, its local effects decline from 0.05 to −0.22, indicating that this value range of speed inhibits the distance gap. As the speed continues to increase, the local effects grow to be positive. Moreover, we found that when the speed ranges between 60 and 75 mph (the average volume for this speed range is 281 veh/5 min), the corresponding effects for both time and distance gaps are stable around value 0, as observed from Figures 6(c) and 6(d). Such a finding demonstrates that this traffic state has minor promotion/inhibition on both gaps.

Figures 7(a) and 7(b) demonstrate the main effects of lighting on the time and distance gaps; the two plots reveal an approximate concave trend. As shown in Figure 7(a), the local effects of daylight and dawn (i.e., lighting = 0 and 1) on the time gap fall in the range of 0–0.20, while streetlights and no streetlights (i.e., lighting = 2 and 3) have the most negative effects. Such variations in local effects indicate that a darker environment will accelerate the occurrence of SCs. Probably because the driver’s sight distance in dark situations depends on the space illuminated by the streetlights and headlights, leading to a lack of timely and clear perception of the current road condition, resulting in insufficient avoidance of PCs and thus reducing the time gap. Figures 7(c) and 7(d) display the interaction effects between lighting and volume. As observed, all points are approximately divided by their color into upper-right and lower-left parts, with most of the pale and dark blue points (i.e., representing daylight and dawn) being above the horizontal axis where the local effect is −0.1, the red and orange points (i.e., denoting streetlights and no streetlights) being below it. In other words, a bright environment has a larger volume and positive local effects, while a dark condition has a relatively smaller volume and negative local effects. It makes sense that the vehicle trips are more during the day than at night. Likewise, it is reasonable to consider that high volume likely prolongs queue length and therefore increases the distance gap.

Figures 8(a) and 8(b) represent the main effects of violation category on the two gaps. As observed, improper turns (i.e., violation category = 4) have the maximum SHAP value. Specifically, its local effects on the time and instance gaps roughly fall between 0–0.10 and 0–0.15, respectively; such ranges indicate that this violation category promotes the time and distance gaps to a varying degree. The reason might be that the crashes in which the violation category is improper turns probably block turn lanes (usually on a one-way road), thus affecting the vehicles behind and causing a long queue length. Followed by another violation category of unsafe lane changes (i.e., violation category = 3), which shows positive correlations with both gaps. Likewise, crashes caused by unsafe lane changes likely block multiple lanes and involve several vehicles, thus decreasing the road capacity significantly and extending the queue length. Besides, this type of crash is more visible. That means drivers behind can catch the crash information at a distance and drive more carefully, increasing the time and distance gaps. By contrast, the other four violation categories have more negligible local effects. As shown in Figures 8(c) and 8(d), it is the interaction effects between violation category and speed. We find a strong association between crashes involving alcohol (i.e., violation category = 0) and high speed, because points are red on the first vertical column. Another interesting finding is that the red points in the fifth vertical column (i.e., violation category = 4) are concentrated at the bottom, illustrating that those crashes, which occurred due to unsafe lane changes at high speeds, reduce the time and distance gaps.

Figures 9(a) and 9(b) represent the main effects of collision severity. The fatal crashes, severe injury crashes, and light injury crashes (i.e., collision severity = 0, 1, and 2) have a promotion on the time and distance gaps, while only complaining crashes (i.e., collision severity = 3) mainly have inhibition on the two gaps. One possible reason is that serious crashes attract more attention, such as rapid rescue and intervention by traffic police, so that SCs do not occur at a close time and distance. Figures 9(c) and 9(d) show the interaction effects between collision severity and speed. As observed, most of the blue points (represent the sample of fatal crashes) occur in the speed range of 60–70 mph, suggesting that serious crashes frequently occur at high speeds.

The main and interaction effects of other variables are presented in Figures 10 and 11. As shown, plots of population reveal a broadly upward trend, varying from negative to positive. A dense population (i.e., Population = 3 and 4, indicating the population is more than 250000) promotes the time gap and distance gap. One possible explanation is that car ownership and travel trips may be relatively high in these densely populated areas, leading to long queuing times and length. The local effects of most weekdays (i.e., weekend = 0) and peak periods (i.e., peak = 1) on the distance gap are greater than the value 0. It makes sense that weekdays and peak periods have many commuter trips, resulting in high volume on the road. The plot for collision type shows a trend on the time gap while a downward trend on the distance gap. The effects of clear days are around the value 0, while the effects of most cloudy days are less than the value 0. Such a comparison indicates that cloudy days will inhibit both gaps, i.e., SCs will occur sooner and closer on cloudy days. Drinking (i.e., alcohol involved = 1) mostly has negative local effects, meaning that drinking will reduce the time and distance gap. This is consistent with reality. The wet surface (i.e., road surface = 1) inhibits both gaps, which is consistent with existing knowledge [16]. It makes sense that a wet road surface harms the vehicle’s stability, such as a brake failure, thus accelerating the occurrence of SCs.

6. Conclusions

This study aimed at predicting the time and distance gaps between SCs and PCs on highways and to analyze how the influencing factors contribute to the gaps comprehensively. First, a data-driven identification method combining the fixed spatiotemporal thresholds-based method and the speed contour map-based method was developed to identify SCs. A total of 368 SCs were sought out from the total number of 24643 crashes. Then, the RF model was applied to predict the two gaps. The data samples were split into training and testing sets at a ratio of 7 : 3. The results showed that the RF model performed better than KNN and MPR. Additionally, the SHAP method was conducted to explain the outputs of the RF model. Based on this local interpretation method, we revealed variables’ global importance and main and interaction effects on the time and distance gaps.

We found that traffic volume and speed are the important contributors to the time and distance gaps; monitoring traffic conditions helps implement timely and effective management to prevent SCs. Several temporal characteristics, such as lighting and population, contribute more to both gaps than primary crash features and road factors. Compared with road factors, the primary crash characteristics of violation category, party count, and collision severity demonstrate more significant effects. With these findings about factor priorities, traffic managers and policymakers can develop prevention plans and allocate resources more efficiently.

The local dependence plots quantify the effects of variables. Plots for the continuous variables, i.e., volume and speed, reveal developing trends and several inflection points. For example, the local effects of volume increase monotonically from −0.3 to 0.4 as the volume grows. Such variation indicates that low volume sharply inhibits the time and distance gaps, while high volume boosts them significantly. Additionally, the local effects on the distance gap are around value 0 when volume falls between 300 and 400 veh/5 min, suggesting that the traffic state in this volume affects the gap inconsiderably. The plot for the main effects of speed on the distance gap shows an obvious inflection point. Such critical information above is considerable for traffic safety managers. As for plots about the discrete variables, demonstrate the local effects and corresponding characteristics of different categories of variables. Take lighting as an example: the effects of daylight and dawn are positive, while those of streetlights and no streetlights are mostly having negative effects. That is to say, a darker environment probably accelerates the occurrence of SCs. Where the economic condition allows, it is advantageous to increase the intensity of the lighting. Moreover, crashes involving the violation categories of improper turns or unsafe lane changes possibly cause long time and large distance gaps.

The contributions of this study can be summarized in the following three aspects: (1) proposing a two-stage SC identification method, which combined the static and dynamic approaches. And the identification results on the test data are consistent with existing works, providing a reliable basis for SC analysis. (2) Applying random forest to simultaneously predict the time and distance gaps, which facilitated understanding the relationship between the dependent and independent variables. 17 independent variables selected from temporal, primary crash, roadway, and traffic characteristics and two dependent variables, namely time gap and distance gap, were used as inputs to train and test the random forest model. The results achieved better performance compared with other models. (3) Using a brand-new interpretation technique SHAP to explain the RF model from global and local ways. We made several significant findings which will be definitely helpful for traffic decision makers to formulate strategies.

This research also raises issues in need of further explorations in the future. First, 368 crashes were used in the model training. Although we applied ML models that are advantageous for handling sparse data, small sample sizes may reduce the performance of the models. More data are expected to be required to improve the model performance. Second, 17 variables were used, and future work will cover more types of factors. This study focused on temporal characteristics, primary crash factors, roadway conditions, and real-time traffic parameters. Other factors, such as shoulder width and truck proportion which have shown correlations with the time gap and distance gap of SCs, will be considered in future research. The SC factors are also worthy of being discussed. In the future, it is a potential idea to combine the PC and SC characteristics to explore the time and distance gaps.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This research was funded in part by the National Natural Science Foundation of China (grant no. 52172310), Humanities and Social Sciences Foundation of the Ministry of Education (grant no. 21YJCZH147), Innovation-Driven Project of Central South University (grant no. 2020CX041), and the Fundamental Research Funds for the Central Universities of Central South University (grant no. 2022ZZTS0717).