Abstract

Accurate access to real-time passenger flows on subway platforms helps to refine management in the era of networked operations. The narrow subway platforms suffer from significant crowd scale discrepancies and complex backgrounds when counting passenger flow. In the proposed passenger flow counting algorithm, the feature-enhanced pyramid structure is used to retain the channel information of deep features and eliminate the aliasing effect caused by fusion to enhance the feature representation of the original image and effectively solve the scale problem. The mixed attention mechanism suppresses background interference by capturing the global context relationship and focusing on the target area. On the ShanghaiTech Part_A dataset, the mean absolute error (MAE) and mean square error (MSE) of the proposed algorithm are 2.3% and 1.4% higher than those of the comparison algorithm, respectively. The MAE and MSE on the self-built platform dataset reach 3.1 and 5.7, respectively. The experimental results show that the accuracy of the proposed algorithm is improved and can meet the counting requirements of the subway platform scene.

1. Introduction

Crowd counting aims to estimate the number and density distribution of people in images or videos and is used in fields such as crowd behavior analysis and public safety management. The surge of metro passenger flow on the metro has posed a huge challenge to the organization of traffic and safe operation, such as the difficulty of transportation organization during peak periods and the lack of operability in emergency management. Real-time access to station passenger flow through crowd counting algorithms can provide scientific data support for organizational management and safety alerts. For example, the departure interval can be optimized according to the passenger flow of the subway platform obtained in real time, and the turn-back station can be accurately obtained [1]. The distribution of passenger density on the platform is displayed in combination with the Passenger Information System (PIS) and the Public Address System (PA), so as to induce passenger travel behavior [2] and reduce operational pressure during peak hours. At the same time, it can also implement control strategies [3] such as closing stations and overtaking according to the platform passenger flow, so as to reduce the potential safety hazards caused by congestion.

Traditional crowd-counting algorithms fall into three categories, detection-based methods take the whole human body or body parts as the object of detection and calculate the number of people [4]; regression-based methods treat the crowd as a whole and complete the counting by establishing a mapping relationship between the extracted features and the number of people, such as ridge regression [5] and Bayesian regression [6]; and density estimation-based methods count by learning linear mapping [7] or nonlinear mapping [8] relationships between features and density maps. Traditional methods rely on manual feature extraction, which is less accurate and only applicable to sparse scenes. At present, convolutional neural networks are widely used in crowd counting due to their excellent feature extraction and learning capabilities. According to the structure of the neural network model, it is generally divided into two categories: single-branch structure and multibranch structure. The early crowd-counting algorithms are all single-branch structures. Wang et al. [9] applied CNN to crowd counting for the first time and the model uses the regression method to count. Due to the limitation of network width and depth, the counting accuracy in dense scenes needs to be improved and cannot meet the requirements of cross-scene counting. To solve the cross-scene problem, Zhang et al. [10] proposed the cross-scene counting model (Crowd CNN), and the algorithm fine-tunes the counting model according to the characteristics of the input scene so that it can accomplish cross-scene counting. The different distances between the crowd in the image and the camera lead to different crowd scales. To solve the multiscale problem, various multibranch networks have been proposed. The multicolumn convolutional neural network (MCNN) proposed by Zhang et al. [11] has three branches, which employ convolutional kernels of different sizes for feature extraction of targets at different scales to solve the scale problem. Sam et al. [12] proposed a multicolumn selection network (Switch-CNN), where the input images are first cut, and then parts of the images with different density levels are fed into the corresponding branches separately, and counting is done separately using different regression networks. The quality of the density map determines the counting accuracy. To obtain high-quality density maps, Sindagi and Patel [13] proposed the contextual pyramid model (CP-CNN), which applies the global and local contextual information extracted from different branches to density map generation. Although multibranch networks achieve better counting results, they are accompanied by the problems of large number of parameters, training difficulties, and model redundancy. To solve these problems, dilated convolutions [14], deformable convolutions [15], and generative adversarial networks [16] have been introduced in the field of crowd counting to reduce the complexity of the models and improve the counting accuracy. For passenger flow counting in the subway scene, Sheng et al. [17] proposed a counting method with the head and shoulder of passengers as the detection object. This method performs well when the passengers are sparse, but the counting accuracy decreases due to severe occlusion during peak hours. Zhang et al. [18] used a multiscale feature extraction module and transposed convolutional upsampling to enhance multiscale features but did not consider the effect of background interference on the counting task. Xiao et al. [19] conducted crowd counting in the target area of the subway based on the background difference method, but the background difference method is mostly aimed at moving objects and is not suitable for platform scenes where passengers are mostly stationary or moving slowly. Hu et al. [20] used a hybrid Gaussian background modeling method to compensate for the deficiencies in background differencing, but the regression-based approach makes the correlation between the features learned by the network and the number of people weak, and the accuracy needs to be improved. The double-region learning algorithm proposed by He et al. [21] divides the subway surveillance image into near region and far region and adopts different strategies for counting the two subregions to solve the impact of perspective distortion. However, the method can only divide the image into two fixed regions without considering the variability of the scene. The MPCNet proposed by Zhang et al. [22] uses multicolumn dilated convolution to aggregate multiscale context information in crowded scenes, but the multicolumn structure inference speed is slow and cannot meet the requirements of real-time detection. Tiny MetroNet proposed by Guo et al. [23] adopts a micro-passenger feature extraction network as the backbone network to achieve a balance between counting accuracy and detection speed. In the MDP algorithm proposed by Liu et al. [24], the MetroNext based on the multiscale convolutional attention module can quickly obtain the location information of the train and passengers, and the optical flow algorithm is used to predict the direction of passenger movement. The combination of the two completes the detection of passengers on and off the train. Yang et al. [25] introduced CBAM into YOLOv4 to solve the problem of inhomogeneous illumination in the station to improve the accuracy and robustness of the network. The MPDNet proposed by Yang et al. [26] uses the pyramid vision transformer to extract features and then uses an adaptive spatial feature fusion algorithm to compensate for the loss of spatial information in feature extraction, achieving higher accuracy while meeting real-time requirements.

Most of the current research is aimed at outdoor open scenes, which is quite different from the subway platform scene. The existing passenger flow counting algorithms in the platform scene still need to be improved. For the subway platform, the narrow and long platform leads to more obvious differences in passenger scales in different areas of the monitoring image, and there may be a problem of missing detection of small-scale heads away from the camera side. The variety of building facilities in the station leads to complex background and difficult crowd feature extraction. In addition, most of the existing public datasets are images of open scenes, and there is no public dataset suitable for subway platform scenes. Based on the above analysis, this paper first constructs a metro platform dataset by capturing images from Lanzhou metro platform surveillance video and then proposes a subway platform passenger flow counting algorithm based on feature-enhanced pyramid and mixed attention. The pyramid structure effectively fuses the semantic information and spatial information of deep and shallow features to solve the problem of different crowd scales. A mixed attention module is constructed to aggregate global context information, and the problem of complex background is solved by paying more attention to the target area.

2. Literature Review

The main difficulties of crowd counting in the platform scene are the large difference in head scale and the complex background of the platform. In this section, two types of networks related to the algorithm in this paper, i.e., multiscale feature fusion network and attention network, are reviewed.

2.1. Multiscale Feature Fusion Network

The different distances between the person and the camera in the image lead to the inconsistency of the head scale to be detected. The scale problem is one of the common problems in crowd counting, and multiscale feature fusion is an effective means to solve the scale problem. In the traditional method, the resolution of the input image is gradually reduced to construct the image pyramid in order to obtain the target of the corresponding scale in the image of each level. The effect of this method is significant, but the feature extraction of multiple inputs brings huge memory and time consumption. The feature pyramid [27] uses different layer feature maps as input and adds horizontal links and upsampling to fuse deep and shallow features, and the computational complexity of the model is reduced. The MARNet [28] proposed by Xie et al. improves the feature pyramid structure by introducing dilated convolutions with different dilation rates to enhance multiscale features to obtain richer context information. The STNet [29] proposed by Wang et al. uses a tree structure to hierarchically analyze the head scale, which enriches the scale level and solves the problem of large-scale changes in the head scale. SASNet [30] proposed by Song et al. can learn the correspondence between scale and feature level and obtain the final density map after weighting the confidence maps of different feature levels. The MZNet [31] proposed by Ma et al. enlarges or reduces the initial features to the corresponding level in each zooming path for aggregation and then propagates and utilizes multilevel context information in multiple zooming paths. MSIANet [32] proposed by Zhang et al. uses four branches of different receptive fields for feature extraction and then interacts the features of different branches to deal with continuous scale changes.

The above research studies use different methods to solve the scaling problem in image processing, which have achieved certain results but still have some problems, such as higher complexity of the model and feature loss. The feature-enhanced pyramid structure proposed in this paper uses a channel conversion module to highly preserve the channel features and a semantic consistency learning module to simplify the model while solving the aliasing effect.

2.2. Attention Network

The main idea of the attention mechanism is to allocate limited information processing resources to the parts of the input that are useful for task execution, and the widely used ones in crowd counting algorithms are channel attention, spatial attention, and pixel attention. The FANet [33] proposed by Niu et al. sets the weight of the background area to zero and weights the target area according to the area where the crowd is located and the density to exclude background interference. The MS-SPCANet [34] proposed by Wang et al. assigns different channel weights to different spatial positions of the channel feature map, in order to highlight useful information and suppress useless information to the greatest extent. MGANet [35] proposed by Li et al. uses spatial attention to focus on the human head region to solve the problem of foreground and background confusion and uses channel attention to enhance the dependence between features and improve semantic expression. In the coordinated attention module CA [36] proposed by Hou et al., the channel attention is decomposed into two one-dimensional feature coding processes, and the features are aggregated along two spatial directions. In this way, long-range dependencies can be captured in one spatial direction, while precise position information can be preserved in the other spatial direction. In CAFNet [37] proposed by Wang et al., pixel attention and channel attention are used to integrate low-level features into high-level features, and then density maps are generated by combining each layer of features that adaptively aggregate local context.

The existing research on attention mechanism is relatively rich, but there are still some limitations. Some studies only consider channel attention or spatial attention, which is not comprehensive enough, while the research considering both ignores the global relationship of feature maps. The mixed attention mechanism proposed in this paper uses the idea of nonlocal operation to obtain the long-distance dependence of spatial and channel feature maps to make full use of context information to obtain high-quality density maps.

3. Algorithm

The main difficulty in counting passenger flow in the subway platform scene comes from the high density of crowds during peak hours. The camera angle on the platform is low, and the head scale tends to increase from far to near and the scale difference is large, which needs to be taken into consideration in the algorithm design. In addition, since there are many escalators and other building facilities on the platform, the complex background brings difficulties to crowd feature extraction, and the interference brought by the complex background needs to be minimized when designing the algorithm.

Figure 1 shows the network framework of the algorithm in this paper, consisting of a VGG-16 network with the fully connected layer removed, a feature enhancement pyramid structure, and a mixed attention module. Taking the platform monitoring image as input, the first 13 layers of VGG-16 are used to extract the image features. The original features are sent into the feature-enhanced pyramid structure, and the problems of different crowd scales and small target missed detection are solved by aggregating features of different scales. Then, the fused features are sent to the mixed attention mechanism, which can effectively focus on the global information by capturing the long-distance dependencies of any two positions in the space or any two channels, which is helpful to solve the problem of background interference and occlusion. Finally, the attention feature map is upsampled to the size of the input image, and the predicted density map is obtained. After the integral sum, the number of passenger flows in the image can be obtained.

3.1. Feature-Enhanced Pyramid Structure

Targets of different scales in subway platform surveillance images will have a semantic generation gap after the same proportion of downsampling, which is manifested by the loss of small targets after multilayer convolution. The feature pyramid captures targets of different scales by fusing deep and shallow feature maps and solves the problem of missed detection of small targets. However, the traditional feature pyramid has the following disadvantages [38]. Firstly, the lateral link uses 1 × 1 convolution to reduce the number of channels of deep features so that the deep and shallow features can be fused, but this operation causes a large loss of channel information of deep features. Secondly, 3 × 3 convolution is used to eliminate the aliasing effect after feature fusion, which introduces redundant calculation. Therefore, this paper proposes an improved feature-enhanced pyramid structure, using a channel conversion module (CCM) and externally introduced semantic consistency learning module (SCLM) to solve the above two problems. The specific feature-enhanced pyramid structure is shown in Figure 2.

The backbone network extracts features from the bottom up and takes the feature map after the four-layer convolution of Conv2_2, Conv3_3, Conv4_3, and Conv5_3 as input, recorded as C2-C5. The input feature map is sent to the channel conversion module to convert the reduced channel information into pixel information, that is, the channel information is retained by expanding the width and height of the feature map. As shown in Figure 3, first the channel conversion operation can reshape the low-resolution feature map H × W × α2C into the high-resolution feature map αH × αW × C by upsampling. Since the backbone network uses 2 times downsampling, α is taken as 2 in the algorithm for the subsequent fusion of adjacent feature maps. At this time, the width and height of the feature map increase by 2 times, and the number of channels decreases to 1/4. Because the number of channels in each layer needs to be consistent with the feature map C2, 1 × 1 convolution is used to enrich the channel information. Finally, 3 × 3 convolution is used to downsample the feature map to the original size, which can aggregate the original channel information at the pixel level. The deep feature map after CCM processing retains rich channel information for subsequent fusion stages.

Due to the inconsistent distribution of features, the direct fusion of deep feature maps with shallow feature maps after sampling will lead to aliasing effects, and the continuity of features cannot be guaranteed. Therefore, before the fusion after CCM and upsampling operation, the semantic consistency learning module is used to standardize the distribution of features. As shown in Figure 2, the SCLM module consists of a 3 × 3 convolution and two 1 × 1 convolutions, and then the consistency features are output through the activation layer. The channel information of the original feature map after CCM and SCLM is preserved and the aliasing effect brought by the fusion process is eliminated, and thus the features are enhanced. The fused feature maps P3-P5 are upsampled to the size of P2 and then spliced in the channel dimension to obtain the feature map F, which preserves more feature information.

3.2. Mixed Attention Mechanism

In the convolution process, the receptive field is limited to a certain range leading to differences in the feature representation between pixels of the same category [39], which then leads to a decrease in counting accuracy. The idea of the nonlocal operation [40] is that when calculating the weight of a certain position, all other positions need to be weighted so that the global contextual information can be fully utilized. Inspired by this, a mixed attention module is built to solve the problem of complex background of station monitoring images from two dimensions. The spatial attention mechanism can capture global dependencies and suppress background interference by focusing on target regions with high similarity. The channel attention weights each channel to highlight the channels useful for the counting task and suppress the useless channels.

Figure 4 shows the specific structure of the mixed attention mechanism, with the left-hand branch being the spatial attention mechanism and the right-hand branch being the channel attention mechanism. The idea of the spatial and channel attention mechanisms is similar, except that the spatial attention mechanism performs a 1 × 1 convolution operation to reduce the dimensionality before reshaping and transposing the feature map. The input feature map of the mixed attention mechanism is , where , , and represent the channel, height, and width, respectively. After convolution, reshaping, and transposition, the feature maps and are obtained; then the matrix multiplication operation is performed and normalized by Softmax to obtain the spatial and channel attention maps and . The formulas arewhere represents the spatial weight of the -th spatial position weighted by all positions , represents the channel weight of the -th channel weighted by all channels , and represent the -th and -th positions of spatial feature maps and , respectively, and and represent the -th and -th channels of channel feature maps and , respectively. The output of the spatial and channel attention module is represented aswhere and denote the spatial and channel attention feature maps, respectively. and denote the -th position or channel of the spatial feature map and channel feature map , respectively, and matrix multiplication is used to reshape the feature maps into . The coefficients and are learnable parameters that are initially set to zero and are adaptively assigned weights to local features through network training. is the -th position or channel of the input feature map. and are fused to obtain a mixed attentional feature map with the same dimensions as .

3.3. Loss Function

The loss function is made up of two parts. The Euclidean distance loss function is the pixel-level difference between the predicted density map and the true density map. The formula iswhere is the number of images, is the -th input image, and is the learnable network parameter. and are the predicted and true density maps for the -th image. The Euclidean distance loss function is based on the premise that pixels are independent of each other, ignoring the correlation between them. Averaging all pixels without attention to structured information leads to blurred density maps and unclear details. To compensate for the shortcomings of the Euclidean distance loss function, the model introduces a structural similarity loss function , which uses three local statistics of mean, variance, and covariance to calculate the similarity between the predicted density map and the true density map. The formula iswhere is the number of pixels in the density map and is the image block corresponding to the same pixels in the predicted and true density maps. is the structural similarity index and is calculated aswhere , , , and denote the mean and variance of the predicted and true density maps, respectively, and denotes the covariance between the predicted and true density maps. and are small constants set to prevent zeros in the denominator. and the image similarity is proportional to the value of .

The final loss function is obtained by weighting and :where is the weighting coefficient used to balance pixel-level loss with structural loss and is set to 0.001 through experiments.

4. Experimental Results and Analysis

The experiment was divided into two stages and the first was the training stage. Taking the training set images as input, the predicted value obtained by forward propagation was compared with the true value to obtain the loss value, and the parameters were updated in the process of backward propagation to make the loss value smaller and smaller until it reached the ideal value, completing the network training. The test set images were then fed into the trained network to obtain the predicted values, where the accuracy and robustness of the network were evaluated by the MAE and MSE.

4.1. Environment and Parameter Settings

All comparison experiments in this paper were completed on the Windows 11 system equipped with an NVIDIA GeForce RTX 3050 graphics card. The environment configuration was CUDA 11.6 + Anaconda 4.13 + Python 3.7 + PyTorch 1.10. The Gaussian distribution was used to initialize the convolutional layer parameters randomly, and the Adam algorithm was used to optimize the parameters. To balance the training speed and the loss, the initial learning rate was set to 1 × 10−5 and the learning rate decay parameter was set to 0.995. The training batch size was set to 16 and the number of iterations was set to 200. To better compare the performance of the algorithms, the experimental parameters of all the compared methods were set in the same way as the methods in this paper.

4.2. Evaluation Indicators

In this paper, mean absolute error (MAE) and mean square error (MSE) are used to evaluate the performance of the algorithm. MAE represents the error between the predicted and true values, reflecting accuracy, while MSE represents the degree of difference between the predicted and true values, reflecting robustness.where is the number of images and and are the predicted and true number of people for the -th image, respectively.

4.3. Dataset Description

To verify the performance of the proposed algorithm, experiments were conducted on ShanghaiTech and UCF_CC_50 public datasets and self-built station dataset, respectively.

The ShanghaiTech dataset contains 1198 images, with a total of 330,165 individuals tagged. The dataset is divided into two parts. The images in Part_A are randomly obtained from the Internet while the images in Part_B are obtained from street surveillance in Shanghai. Part_A is characterized by a high density of crowds and variable scenes, while Part_B is characterized by a low density of crowds but suffers from the problem of large differences in crowd scales. This dataset is a challenging dataset across different scene types and densities.

The UCF_CC_50 dataset images cover a wide range of scenes such as marathons, stadiums, and concerts. The average number of people in the images is as high as 1280, while the number of people in the single image ranges from 94 to 5453, with a large gap in density levels between images, making the dataset challenging. The disadvantage of this dataset is the insufficient number of images, only 50, and thus a five-fold cross-validation method was used to conduct experiments in this paper. The 50 images were randomly and equally divided into five, one of which was used in turn as the test set and the other four were combined as the training set, and the results of the five experiments were averaged as the final result.

For deep learning crowd counting, the quality of the dataset will to a certain extent affect the counting effectiveness of the model. The existing public datasets are mostly images of open scenes, while the long and narrow subway platforms and numerous construction facilities pose the problem of cluttered backgrounds. Due to the height limitation of the platform, the height of its surveillance cameras also differs from the public dataset. In order to better evaluate the performance of the model in this paper, platform images were collected from the Lanzhou Metro to build the dataset. Five stations in Lanzhou Metro Line 1 with high passenger flow, including Xizhanshizi, Xiguan, Dongfanghong Square, Wulipu, and Lanzhou University, were selected to capture images from the surveillance video at one end of the platform waiting area during the morning peak (e.g., 7:00–9:00), evening peak (e.g., 17:30−20:00), and flat peak periods (e.g., 10:00–16:00) of weekdays and weekends. The dataset is labelled with a total of 2000 images, of which 1500 are used as the training set and 500 as the test set. The size of each image is 1200 × 1024.

Typical images for each dataset are shown in Figure 5.

4.4. Experimental Result Analysis

Table 1 shows the experimental results of the proposed algorithm and five other classical or advanced comparison algorithms on the ShanghaiTech dataset. The comparison between the experimental results of the two-part datasets shows that the counting results of sparse scenes are better than those of dense scenes, indicating that dense scenes are still the key direction for future research on crowd counting. The proposed algorithm achieves the best results on this dataset compared to the comparison algorithm. Compared with the best MIA [43] model, the MAE and MSE of Part_A improved by 2.3% and 1.4%, respectively. The MAE and MSE of Part_B improved by 0.9% and 1.6%, respectively, indicating the effectiveness of the feature-enhanced pyramid structure and the mixed attention mechanism, which can perform the counting task well in the case of higher crowd density and different scales.

Table 2 shows the experimental results on the UCF_CC_50 dataset. It can be seen that only the context-aware model (CAN) [41] is superior to the proposed algorithm in the comparison algorithm, and the accuracy and robustness of other algorithms are lower than the proposed algorithm. The CAN network, which uses spatial pyramid pooling to compute scale-aware features, is a multicolumn network that adaptively encodes contextual information. The multiscale enhanced network (MSEN) [42] and the multivariate information aggregation (MIA) [43], which also employed multicolumn structures, have also achieved good results, indicating that the multicolumn structured model works better on this dataset. The algorithm in this paper is a single-column structure, which has less parameters and simpler calculation while achieving competitive results, and can also meet the counting requirements of various dense scenes. The last two columns are the number of parameters and the inference time of each algorithm; the model in this paper is a single-column structure; therefore, the number of parameters is less and the inference time is shorter.

The experimental results of the self-built platform dataset are shown in Table 3. The algorithm in this paper has achieved the best results because the algorithm has been improved on the traditional pyramid. The application of CCM and SCLM makes the channel information of the original feature map retained and eliminates the aliasing effect caused by the fusion process, enhances the feature representation, and helps to solve the scale problem. In addition, the mixed attention mechanism in the algorithm utilizes the idea of nonlocal image processing. By focusing on the relationship between local features, the global context information is fully aggregated to generate a high-quality prediction density map.

To further verify the effectiveness of the algorithm in this paper, the platform of Xizhanshizi Station of Lanzhou Metro Line 1 on April 26, 2023 (Wednesday), was selected, and the passenger flow on the platform was counted every 10 minutes during the period of 6:30−9:00, and a total of 16 groups of predicted passenger flow and the real passenger flow on the platform and the relative error were obtained, as shown in Figure 6. It can be seen from the figure that the number of passengers on the platform increases gradually with time, and the number of passengers on the platform increases significantly after 7:30 and remains at a high level, which is consistent with the trend of passenger flow in the morning peak of weekdays. The relative errors of the 16 groups of data are all within 4.5%, and the average absolute percentage error is 2.71%, which proves the effectiveness and accuracy of the passenger counting algorithm in this paper.

Figure 7 shows partial density maps obtained from the proposed model on different datasets, with every two rows of experimental result maps coming from the same dataset, arranged in the order of the ShanghaiTech, UCF_CC_50, and self-built station datasets. The experimental results on the first four rows of the public datasets show that the counting error is greater for dense scenes than that for sparser scenes, but in general, the enhanced feature fusion and the suppression of background interference by the attention mechanism allow the algorithm to achieve good counting results The predicted values in the last two rows of the experiment are greater than the true values, and observation of the density distribution shows that it is the reflection of passengers by the platform screen doors that causes the repeat counts to bring about the slightly larger predicted values. The experimental results show that the model performs well on both public and self-built station datasets and can make accurate predictions in scenes with very high crowd density, large variations in crowd size, and severe background interference.

4.5. Ablation Experiments

To verify the effectiveness of the modules in the network, ablation experiments were conducted in Part_A of the ShanghaiTech dataset. The backbone network is denoted as Backbone, the traditional feature pyramid structure is denoted as FPN, the feature-enhanced pyramid structure is denoted as FEPN, and the mixed attention mechanism is denoted as MA. The experimental results are shown in Table 4. The comparison between the first two rows and the last two rows shows that the embedding of mixed attention improves the counting accuracy and robustness of the network, indicating that fully utilizing global contextual information works well in crowd counting studies. The comparison between the first and third rows illustrates that the feature-enhanced pyramid structure with channel transformation and semantic consistency learning brings about an improvement in network performance compared to the traditional feature pyramid structure. The loss and accuracy convergence curves of the ablation experiment are shown in Figure 8; in order to ensure the simplicity and readability of the image, the training loss curve and the test loss curve are presented in two figures, and the training accuracy curve and the test accuracy curve are also presented in two figures.

The feature-enhanced pyramid structure proposed in this paper is improved on the traditional feature pyramid structure. While the model achieves excellent performance, it also needs to pay attention to whether this improvement brings redundant calculation. The number of model parameters reflects the calculation amount and running time of the model to a certain extent. Therefore, this paper analyzes the improvement of the feature pyramid structure based on the number of model parameters. As shown in the last column of Table 4, the comparison of the first and third rows shows that the improvement of the feature pyramid brings less than 1MB increase in parameters, which proves that the feature-enhanced pyramid network algorithm does not bring redundant calculation while improving the network counting accuracy.

Figure 8 shows the loss convergence curve and accuracy convergence curve of the model. In the early stage, the fluctuation of training loss is large, mainly because the parameter learning of the network is not yet completed and the model is disturbed by useless information. As the learning proceeds, the training loss curve tends to be stable and converges, indicating that the model has effectively completed the learning. The accuracy convergence curve indicates that the parameters of the model are well set and learned, and the counting performance of the model is good.

5. Conclusion

Based on the problems of large changes in crowd scale and strong background interference in subway platform passenger flow counting, the algorithm proposed in this paper uses a feature-enhanced pyramid structure to retain channel information and eliminate aliasing effects. The enhanced feature representation is more conducive to solving the scale problem. By embedding a mixed attention module in the algorithm, the idea of nonlocal image processing is used to capture the global context information, to obtain a high-quality prediction density map. The algorithm achieves good results on the two public datasets and the self-built station dataset, which proves the effectiveness of the algorithm in this paper. However, there are still some shortcomings in the study. For example, reflections of passengers from platform screen doors may lead to repeated counts and thus large predictions, and preprocessing of the images to cover or cut sections of screen doors with severe reflections will be considered in the future. For the problem that passengers are completely occluded by pillars or other passengers on the platform, resulting in missed detection and small prediction results, the idea of the target detection algorithm can be used for reference in the future to reduce the impact of occlusion on crowd detection from the loss function.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (no. 52262045) and Gansu Provincial Key R&D Program-Industrial Project Funding (23YFGA0045).