Abstract

This paper presents a methodology for actively discovering knowledge in transport hotline databases by analyzing complaints reported by citizens, aiming to assist transportation management departments in planning actions to investigate and improve service quality. The proposed model uses text mining techniques and applies latent Dirichlet allocation (LDA) to identify topics that are related to transportation services. Consequently, we actively analyzed over 230,000 phone calls occurring in a certain province between 1st January and 31st December 2021. Specifically, we actively analyzed nearly 22,000 phone calls about the taxi industry within a selected city, and identified six topics, including lost and found (27.1%), car blocking (20.6%), attitude and behavior (17.1%), online car-hailing (12.8%), illegal operations (11.2%), and fare issues (11.2%). By actively referring to past and ongoing best practices, we actively recommend several policy implications. The proposed method thus actively transforms the service center record into a customer feedback-based assessment system to intently monitor drivers’ professionalism while efficiently addressing customers’ complaints and concerns.

1. Introduction

China has experienced rapid urbanization, in recent decades, resulting in a rise in urban problems. This has generated public discontent and complaints, placing substantial pressure on local authorities [1]. On top of that, improving transport management and public satisfaction requires a comprehensive understanding of citizens’ grievances. To address this, the Ministry of Transport of China has established a toll-free telephone hotline, 12328, for receiving, as well as responding to public complaints and concerns. Besides, the calling center serves as a direct channel for reporting grievances and provides an efficient medium for monitoring public sentiment without significantly increasing government resource allocation. Over time, text mining has emerged as a powerful tool for retrieving and analyzing large-scale and complex data, transcending various fields, disciplines, cultures, and languages [2]. This approach offers a promising avenue for improving transport public services.

Similar to the United States of America’s (U.S.A.’s) 311 system [3] and Brazil’s 190 hotlines [4], the 12328 hotline in China is a crucial component of the country’s efforts to streamline administration and enhance services in the transportation sector. The 24-hour call center serves as a vital platform for citizens to access real-time information on highway conditions, lodge complaints, and report issues such as road damage. Based on official statistics from the Ministry of Transport, the hotline has successfully assisted over 100 million users nationwide since its establishment. Coupled with that, the hotline boasts a remarkable immediate response rate of approximately 99%, with an average wait time, including voice prompt, not exceeding 26 seconds when encountering a busy signal [5]. However, the hotline faces a significant risk of data explosion since it receives hundreds of thousands of messages daily, necessitating effective time management strategies.

This study aims to develop a methodology for actively discovering knowledge in transportation response service databases by focusing on public complaints. The generated information serves as a valuable resource for city governments to identify public interest issues and strategically plan resource allocation for optimizing urban transport governance. The main contribution of this study is the development of a supportive method that enables the transportation department to promptly and accurately address public service demands. To achieve this lofty objective, the methodology was successfully implemented in collaboration with the local transport agency in the city of Hohhot, located in the Inner Mongolia province of China.

The remaining sections of this paper are structured in the following manner. In Section 2, we begin with a literature review that incorporates relevant studies in texting mining. We then proceed to outline the data collection process and methodology in Section 3. Section 4 focuses on presenting the managerial implications and discussing the results of the analysis. Finally, we conclude the paper in Section 5 by addressing the limitations of this study and suggesting potential areas for future studies.

2. Literature Review

Numerous studies have extensively explored text mining and its applications from various perspectives. Considering the findings of previous studies about this phenomenon, we delve into the investigation of text mining at two distinct levels: text mining in governance and text mining in hotline. By examining past research in these areas, we gain valuable insights into the importance of adopting a text mining approach for problem detection. Hence, this analysis assists us in better understanding the rationale behind utilizing text mining techniques.

2.1. Text Mining in Governance

Accurately identifying valid attributes that meet user needs and expectations is a crucial step in analyzing user satisfaction. Traditionally, surveys have been designed to evaluate user expectations, experiences, and satisfaction; however, they often pose challenges in terms of feasibility, time, and cost when deployed on a large scale [6]. According to Kumar and Ravi [7], customers are often reluctant to provide honest feedback to service providers, especially regarding their dissatisfaction. Moreover, using a large number of attributes in satisfaction surveys is likely to cause fatigue, potentially compromising both the validity and reliability of these studies [8]. Therefore, in recent years, governance and decision-making processes have increasingly relied on resident-reported data and data-driven approaches to enhance city operations and planning efficiency [9]. Furthermore, text mining technology has emerged as a widely applied method to study demands in various fields, including urban governance [1, 10], ecosystem assessment [11], water safety [12], solid waste sorting [13], and others.

Text mining encompasses various tasks such as document clustering, document classification, text summarization, sentiment analysis, topic detection, patent analysis, and decision-making [7]. In addition, currently analyzing the underlying emotions, opinions, evaluations, and attitudes within this platform proves valuable in monitoring public sentiment and helping local government officials to establish sound policies. Wang et al. [14] used an objective approach that combined big data collection and text mining to explore public attitudes toward off-site construction, while synchronously providing cogent recommendations to the Chinese government. In the context of the 2019 Chennai water crisis, Xiong et al. [15] leveraged sentiment analysis and topic modeling to examine public opinion through Twitter. In a similar manner, Lee [11] proposed an innovative method that combined text mining and factor analysis of stakeholders to analyze expert opinions on resident participatory ecosystem service assessment in South Korea. Furthermore, an automated mechanism called Web-complaint Quality Control (WebQC) was introduced, with a feature that can identify complaint messages and issue warning signals when the complaint volume surpasses the usual level.

That said, after conducting an extensive literature search, only a limited number of studies exploring text mining in the transport sector were found. Aman et al. [16] analyzed app store reviews in the context of transportation research to gain insights into e-scooter rider satisfaction. Roh and Jeon [17] presented an example where basic analysis results were applied to the decision-making process for introducing new transportation means or systems in Incheon, Korea. Besides, Askari et al. [18] applied principal component analysis (PCA) followed by partial least squares structural equation modeling (PLS-SEM) when they investigated the main factors influencing users’ perception of fixed-route taxi services in Shiraz, Iran. Seo et al. [19] utilized text mining techniques and a clustering algorithm to identify key issues related to mobility services. Hsiao et al. [20] applied text mining to enhance cross-border logistics services (CBLS). Despite the voracity of studies in this area, the literature still lacks a comprehensive research framework in the domain of text mining within the transport sector, which informs this research.

2.2. Study on the Hotline

In the past, text mining primarily focused on examining short textual content from various social media platforms, such as Facebook, YouTube, Instagram, Flicker, and app stores [21]. According to White and Trump [3], in the context of the U.S.A., 311 calls serve as indicators of service needs. That is why Kontokosta and Hong [9] investigated the impact of sociospatial disparities in 311 complaint behavior on the fairness of data-driven decision-making. Intriguingly, Yildirim and Arefi [22] explored the relationship between noise complaints obtained from 311 nonemergency services and transportation-related inequality in Dallas, Texas.

Furthermore, Basilio et al. [4] developed a methodology aimed at uncovering knowledge within 190 emergency response service databases by analyzing police occurrence reports. This generated valuable information that was required to assist law enforcement agencies in Rio de Janeiro in planning actions to investigate and combat criminal activities. Ridpath et al. [12] analyzed previously collected data from calls made to wellcare® by private well owners. Building on their findings, Peng et al. [1] demonstrated the effectiveness of the Chinese “12345” citizen service hotline in identifying and addressing urban problems. However, limited research has been conducted on mining the context of the 12328 service hotline. It is thus recommended that a policy approach focused on enhancing public satisfaction and delivering service improvements should incorporate content mining of transportation service hotline as a methodology.

3. Methods and Materials

3.1. Methods

LDA [23] is one of the most popular topic models, unlike supervised learning or clustering models that rely on word calculation loss or distance functions, LDA directly calculates the co-occurrence rules of different topic words. Results indicate that LDA is an efficient and a valid content analysis tool for finding latent streams of thoughts expressed in e-petitions [24], online review site [25], and other short, unstructured, and text-heavy field. The topic model follows a three-tier Bayesian probabilistic model structure, consisting of words, topics, and text. Figure 1 illustrates the probability map model of the LDA.

In this paper, we treat the “reflect content” field in the “12328” transportation supervision service hotline as the “text” in our model. We also extracted a total of (number of words) from each “reflect content” to get words . Equally, two super parameters were generally set as . We estimate the parameters by calculating the “text-topic probability distribution” and the “subject-word probability distribution,” denoted as and . Correspondingly, we set the number of subjects as , and assign topic numbers to each word . Finally, we summarize the theme based on their relevance, semantics, characteristics, and biases in freight logistics. We thereafter used the term frequency-inverse document frequency (TF-IDF), which is a commonly used weighting technique for studies of this nature for information retrieval and data mining, since it is an adequately suitable classification scheme for this study. Its main analogical concept is that if a word has a high TF in a specific text but appears rarely in other texts, it indicates the good discriminative ability for categorization.

Thus, TF represents the word frequency, which indicates how frequently a word appears in the text, as shown in equation (1). Typically, this value is normalized by dividing the word frequency by the total number of words in the document. Therefore, normalization proportionately assists in preventing a bias towards longer files.where represents the number of occurrences of the word in the text , and the denominator is the sum of the occurrences of all words in the corpus D. While IDF denotes the inverse document frequency, which is computed by dividing the total number of documents by the number of files containing the word, and then taking the logarithm of the resulting value, as shown in equation (2), taken together, a higher IDF value is assigned to words that appear in fewer documents, which is contained in entries t. Hence, larger IDFs indicate stronger category discrimination ability as specified in the following model:where |D| is the total number of text entries in the corpus. , while the number of texts is denoted by containing the word . In cases where the word is not present in the corpus D, the denominator becomes zero, hence is generally used. Congruently, the topic model is a machine learning model that operates in an unsupervised manner and can automatically analyze documents in a corpus to extract underlying topic information based on word co-occurrence. This model is widely used in natural language processing for recognizing topics in large-scale document sets or corpora [26]. For instance, if a document is related to freight logistics service, specific words such as “highway,” “vehicle,” and “driver” will occur more frequently. Contemporaneously, LDA is a crucial component of the topic model and is well suited for topic modeling.

3.2. Materials

In collaboration with the local transportation management department, the data collection process involved applying daily records spanning from 1st January to 31st December 2021 from 12328 centers. Consequently, the establishment of the 12328 hotline center took place by the end of 2017, connoting the significance placed on enhancing hotline services by government authorities. That is why a set of departmental performance indicators was designed. Over the years, there has been a steady increase in the measured input, accompanied by a growing quantity of telephone traffic (as shown in Figure 2), and an overall improvement in data quality gets better gradually. To ensure that governance is based on more accurate and up-to-date data, this study specifically focused on hotline calls that occurred in 2021. Later on, we processed the data using the Python software package.

Above and beyond, the “12328” transportation service supervision telephone system operated the hotline professionally. Notably, the operators of 12328 recorded detailed information, including the report date, closed date, demographic information, type of demand (such as complaints, consultation, and suggestions), type of service (such as highway, expressway, road transport, and urban traffic), and a description of the request or complaint, on a standardized form. Table 1 presents a sample of the data.

For further analysis, a preliminary dataset consisting of 223,599 records was utilized after removing system testing dials, records with no voice or supply, and hang-out information.

3.2.1. Temporal Analysis

Figure 3 presents a temporal analysis that aligns with previous evidence indicating variations in call arrival rates by month, week, day, and different hours within a day [27]. Similarly, the annual analysis highlights September and December as the months with the highest call volumes, followed by October and July (Figure 3(a)). This can be attributed to increased information needs during national holidays and the period surrounding major holidays when people embark on journeys, as well as students returning to school around 1st March and 1st September. Surprisingly, the peak cell volume occurs between the 7th and 9th of the month, warranting further attention and research (Figure 3(b)). Furthermore, Monday and Friday record the highest number of calls within a week (Figure 3(c)). Apart from this, the call patterns also align with the government’s work schedule, as depicted in Figure 3(d), with the first peak period observed from 9:00 to 12:00 and the second peak period from 15:00 to 17:00. Despite the hotline’s 24-hour service availability, immediate problem resolution is unlikely during times when supervisors are resting, resulting in delayed resolutions for most callers.

3.2.2. Spatial Analysis

Figure 4 visualizes the spatial distribution of callers who made telephone calls received by the Inner Mongolia 12328 system. The nationwide distribution is represented by red colors, while the provincial distribution is indicated by green colors. Moreover, the darker the color of a region, the higher the number of telephone calls received. Nonetheless, the baseline data revealed that Hohhot, Baotou, Chifeng, and Ordos emerged as the cities with the highest number of reports, collectively accounting term frequency-inverse document frequency for 66.1% of the total. Interestingly, these cities also contribute significantly to the GDP of Inner Mongolia, as well as account for 63.9% of the provincial GDP, based on the National Bureau of Statistics data. This finding aligns with previous research suggesting a direct relationship between higher economic growth rates and transportation improvement [28]. Equally, the Inner Mongolia 12328 system served individuals located outside the province. Further analysis revealed that 60% of these calls were made by truck drivers, while 10% were made by tourists who lost items in taxis.

3.2.3. Classification Analysis

Figure 5 depicts the time-varying call volume over the past 12 months. The total number of records is represented by the gray line, while different request classifications are denoted by color spots. For almost the entire year, the requests related to urban traffic (red) surpass those for expressways (green), road transport (yellow), and highway (blue) most of the year, with a few exceptions on specific days, such as 6th–12th November, 8th–9th December, and 28th February. Upon further investigation, we discovered a strong correlation between these exceptional data points and extreme weather conditions.

A previous study on climate effects concluded that fog, heavy rain, rainfall, and snowfall significantly impact intercity travel demand [29]. Indeed, on 28th February and 8th December, the Inner Mongolia Meteorological Center issued a yellow warning for road icing, while on 6th, 7th, and 8th November, they issued warnings for a snowstorm ranging from yellow to red. These extreme weather events resulted in expressway closures, prompting a surge in inquiries from drivers. This explains the abnormal peak period observed in the upper right picture of Figure 3.

3.2.4. Text Mining Area

Based on these findings, the main research areas identified are urban traffic and expressways. Given that state-owned companies oversee the supervision of provincial expressway operations, addressing the aforementioned issues can be effectively tackled by implementing a centralized traffic information release system that ensures timeliness and effective communication. This can be achieved by offering more accurate information releases and traffic guidance services to drivers, thereby enhancing their overall experience.

The taxi industry specifically represents the urban traffic sector in this study, as indicated by the presence of the “12328” logo on every taxi vehicle. Meanwhile, citizens typically contact the bus company directly for information related to public transportation. Essentially, taxi services play a crucial role in facilitating urban mobility, contributing to the overall transport system [30]. Considering that local authorities regulate and monitor the quantity, pricing, and quality of taxi services through policy regulations and guidelines [31], we decided to focus our research on the city level rather than the entire province.

To evaluate the exposure level and identify a representative city, we introduced a new evaluation metric called the index of complaints per passenger. Figure 6 illustrates that approximately 23.0 thousand telephone calls were made to the 12328 centers, while the number of passengers transported by taxis in 2021 reached 82.2 million. Based on these figures, we conducted an empirical study focusing on the Hohhot taxi industry as our case.

Hohhot, the capital of Inner Mongolia, has a significant reliance on taxis within its transportation system. Based on available data from the Transport Department, there are 7,188 licensed taxis operated by 31 companies, with approximately 12000 active taxi drivers providing personalized services.

3.2.5. Framework

We used the LDA model to automatically identify latent topics through a machine learning process. Following the approach taken in previous studies [32], our proposed methodology consists of two main steps: preprocessing and knowledge extraction. Furthermore, we provided a visual representation of these results in Figure 7. It is worth noting that high-quality preprocessing is crucial for achieving superior outcomes [7]. We also constructed a dictionary and performed a step-by-step clean word segmentation. Subsequently, we conducted keyword extraction and calculated the TF-IDF weights.

The application of the weighting technique commonly used for information retrieval and data mining, i.e., the TF-IDF, reflects the intent of the researchers to holistically interrogate the phenomenon. Its main principle is that words or phrases with a high TF in a specific text and low occurrence in other texts possess strong category discrimination ability, making them suitable for classification purposes. In this study, the analysis of the contents from the 12328 service was conducted using the Python programming language. Specifically, we utilized various packages for different tasks. The “re” package was used to extract text information, “jieba” was used for segmenting Chinese words [14], “gensim” assisted in the cleaning and preparation of the text [33], “matplotlib” facilitated the graphical output of the analysis, and “pyLDAvis” was used for visualizing and refining the LDA model [34].

3.3. Preprocessing

Preprocessing involves the transformation of the original raw texts into a dataset that is ready for mining [35].

3.3.1. Dictionary Building

Previous studies overlooked the systematic extraction of service-specific keywords and instead relied on using all keywords. This approach failed to account for the significant internal variability in languages across different geographical and social contexts. To effectively extract hidden topics within the given context, we constructed three dictionaries: the add-words dictionary, the stop-words dictionary, and the synonym-words dictionary. These dictionaries were developed based on the transportation background and local terms ensuring a more comprehensive and accurate representation of the domain-specific language.

We excluded certain words and phrases from the stop-words dictionary based on the analyzed morphemes. Specifically, words such as “hello,” “lady,” “gentleman,” “user,” “citizen,” “seats,” and “caller,” and phrases like “about,” “related,” “aspects,” and “matters” were considered as noise and omitted from the analysis. Additionally, due to the subjective nature of classifying incoming calls as complaints, consultations, or suggestions, we included these three words and related terms such as “ask,” “ask for,” and “want” in the stop-word list. The focus of the current study was, therefore, to identify key issues in the taxi industry, thus factors such as person deixis (e.g., caller, user, passengers, and driver), place deixis (e.g., pick and drop-off points), and time deixis were disregarded.

We created a synonym-words dictionary to replace abbreviations with standard expressions. For instance, “general goods” was replaced with “general goods transportation,” “highway” remained unchanged, and “green pass” was replaced with “green channel,” among others. This was done to ensure consistency and clarity in the analysis.

Due to the presence of normative terms in the government hotline text, the initial dictionary was inadequate, resulting in inaccurate topic mining and impacting the effectiveness of problem identification. To address this issue, extensive testing was conducted, and the custom dictionary was continuously expanded based on the results of word segmentation to improve the accuracy of subject mining. This iterative process continued until the processing results met the identification requirements. Ultimately, the custom dictionary was expanded to include 274 words, the stop-words dictionary was expanded to include 30 words, and the synonym dictionary was expanded to include 28 words.

3.3.2. Data Cleaning

Before conducting unsupervised analysis using the LDA method, the collected texts needed to be prepared. The cleaning stage of the analysis focused on removing two types of noise. The first type was symbol noise, which included punctuation, special characters, web addresses, and new line characters. To eliminate non-Chinese components from the data set, the “re” regular expression package in Python was utilized. Additionally, all special characters and numbers were excluded from the text, as there was no requirement to analyze license plate numbers or other irrelevant text noise in the context of this research.

3.4. Knowledge Extraction
3.4.1. Word Segmentation

We uncovered that Chinese vocabulary does not have spaces between words and can be divided into single-character, bi-character, and multiple-character based on the number of characters [36]. Therefore, Chinese word segmentation has inherent limitations in text mining [14]. In this study, we used the TF-IDF algorithm, which calculates the importance of a word in each document based on its frequency in multiple documents using different frequency types [19] to identify and extract keywords.

3.4.2. Perplexity Check

To identify the appropriate number of topics, we assessed the perplexity which quantifies the ability of the model to predict new documents [37]. Figure 8 illustrates the fluctuation of perplexity as the number of topics increases. An elbow point is observed when the numbers of topics for model training are set as 3, 6, 13, and 15, respectively. By examining the keywords generated for each topic and gathering additional information, we determined that the optimal number of topics is 6.

3.4.3. Identifying Topics

The parameters for running the LDA model were set as follows, enabling the extraction of topics and performing topic clustering or text sorting based on the analyzed documents’ topics [38].

To run the LDA model, the parameters were set as follows:n_features = 500,n_topics = 6.

By analyzing the topics and the corresponding characteristic terms, we successfully identified six distinct topics over a specific period.

3.5. Visualization

We utilized the PyLDAvis package to visually examine the output of the developed LDA model, which facilitated the interpretation of the topics. Figure 7 presents the main reasons for calls from passengers, with the prominent topic being “lost and found,” reflecting the ongoing urban development in cities. This is followed by topics such as “car blocking,” “attitude and behavior,” “online car-hailing,” “unlicensed taxis,” and “fare issues.” In addition, Figure 9 illustrates a two-dimensional intertopic distance map, where each topic is represented by a shaded circle, providing a clear visualization of the topic relationships.

4. Results and Discussions

4.1. Results

Through our analysis (as shown in Figure 7), we have uncovered six distinct and interconnected topics within the taxi industry. Over and above, the size of the circles on the left side indicates the significance and importance of each topic, with the largest circle representing the topic of lost and found (labeled as number 1), which appears to be the most prominent concern among the texts examined. Besides, the map illustrates both the homogeneity within individual circles and the heterogeneity between different topics in the given documents. Correspondingly, Topic 2 (car blocking), positioned in the fourth quadrant, represents an independent issue. The proximity between topics emphasizes the strong interconnection between the fourth topic (online car-hailing) and the fifth topic (unlicensed taxis), as indicated by the overlapping circles. Accordingly, topics 3 and 6, which pertain to the traditional taxi industry, share a common research theme.

The right panel of Figure 6 presents a word list displaying the top 30 most salient terms related to the selected topic. The blue bar represents the overall term frequency in the data set, while the red bar represents the estimated term frequency in the selected topic. In addition, Table 2 showcases the identified topics along with ten frequently occurring relevant words derived from the top 30 salient terms identified by the model. For this research, the originally identified Chinese terms have been translated into English. Based on an analysis of 22,713 recorded call events, a clear conclusion emerges: the inquiry service constitutes nearly half of the activities within the taxi industry (Topic 1 and Topic 2, accounting for 47.7%), whereas complaint reporting comprises over half of the usage (Topics 3–6, totaling 52.3%).

Identically, in Topic 1 (i.e., lost and found, which constitutes 27.1% of the distribution of frequently used words), it can be deduced that the predominant topic revolves around passengers who have lost their belongings in taxis and are seeking to establish contact with the driver. However, a considerable number of callers are unable to provide the license number or receipt of the taxi. Moreover, mobile phones, keys, and ID cards are the most commonly reported lost items in this context.

The second topic of the model revolves around urban parking management. In Topic 2 (i.e., car blocking, which constitutes 20.6% of the distribution of frequently used words), the requests to contact taxi drivers resolve issues related to blocked vehicles. Callers have made attempts to reach out to the taxi company whose logo is displayed on the vehicle but have faced difficulties in establishing contact. We observed that the majority of individuals seeking assistance in this regard are private car owners.

Relatedly, Topic 3 (i.e., attitude and behavior, which constitutes 17.1% of the distribution of frequently used words) deals with behaviors that revolve around the negative attitude and concerning behaviors exhibited by taxi drivers, which often coexist with aggressive driving practices such as attempting to cut off other drivers and engaging in violations like refusing passengers, carpooling, overcharging, and taking detours. This finding aligns with previous research indicating that driver behavior, including aspects related to attitude toward passengers, driving practices, and professionalism, significantly influences passengers’ overall assessment of service quality [18]. Contributions have also highlighted the strong association between conflicts with clients, traffic accidents, and risky driving behavior among taxi drivers [39, 40]. The presence of these illegal operating behaviors poses security risks to passengers and undermines the city’s sense of civility [41]. Furthermore, taxis, characterized by their high mobility and wide dispersion, pose challenges for law enforcement officials to quickly and accurately assess and address illegal activities committed by drivers [42].

On top of this, in Topic 4 (i.e., online car-hailing, which makes up 12.8% of the distribution of frequently used words), the focus is on conflicts and issues related to popular online car-hailing platforms such as Didi and Hua Xiao Zhu, which provide on-demand transportation services. The rapid advancement of information and communication technologies has led to the emergence of ride-hailing services as a prominent component of the public transportation system, representing a typical form of the sharing economy in the road transport sector [30]. In response to this, the Chinese government introduced regulations in 2017 that established eligibility criteria for drivers and vehicles, thereby indirectly controlling the maximum number of registered drivers on on-demand ride-hailing platforms [43]. Table 2 highlights fare issues, including overcharging and automatic charges to passengers’ WeChat accounts, which constitute the central theme in the online car-hailing industry, which aligns with previous evidence indicating that “higher charges” is a primary complaint in Shanghai [44]. However, it is worth noting that a considerable number of complaints are based on unfounded records due to the lack of accurate information that is being provided.

Additionally, in Topic 5 (i.e., unlicensed taxis, which constitute 11.2% of the distribution of frequently used words), the regulation of entry into the taxi plays a crucial role in taxi regulatory systems, aiming to ensure compliance and safety standards. Nevertheless, the allure of high profits in the taxi sector has resulted in the proliferation of illegal businesses. Consistent with the findings of similar studies, many taxi operators engage in operating without proper licenses and using vehicles that are not roadworthy, posing a significant risk to passengers’ safety and well-being [45]. This study identified illegal taxi services as the fifth theme. The prominent keywords associated with this topic include “illegal,” “cross-regional,” and “personal car.” Conversely, it is important to note that the evidence provided supporting these complaints is insufficient, making them understandable but unfounded. To address these issues, the Chinese government has implemented stringent access controls in the taxi industry to mitigate risks and ensure compliance with regulations.

The last topic identified by the LDA model is related to transport expense. In Topic 6 (i.e., fare issues, which makes up 11.2% of the distribution of frequently used words), commuters’ grievances regarding overcharging for taxi fares. Notably, the main factors contributing to these overcharge problems include drivers taking longer routes (detours), inaccuracies in the taximeter, and instances of drivers not adhering to the prices displayed on the taximeter. Additionally, issues related to requesting an invoice and discrepancies in charging standards are also mentioned by passengers. These concerns highlight the importance of fair and transparent pricing practices in the taxi industry.

Over the years, the government has consistently proposed policies and strategies aimed at enhancing and promoting public transportation. While there have been some improvements in service quality and support capacity in recent years, the challenges facing the taxi industry in contemporary society have remained largely unchanged for the past two decades. Nourinejad et al. [46] emphasized the need for reducing on-street illegal parking as a fundamental aspect of mobility policy, while Ibitayo [47] addressed the issue of dangerous driving. Furthermore, the persistently low service standards not only impact the customer experience but also contribute to increased private car usage and potential environmental concerns, such as pollution and increased energy consumption [48]. The government must establish a mechanism for ensuring service quality, for taxi service providers to take appropriate measures, for drivers to adhere to road traffic safety laws and regulations, and for the public to continuously voice their demands. Addressing these fundamental problems is essential for achieving sustainable development in the taxi industry.

4.2. Strategies

In this section, we propose several potential mitigation strategies aimed at reducing the burden of traffic violations and ensuring smooth, safe, and efficient traffic operations in the study area.

Firstly, in response to Topic 1: lost and found, the government should respond promptly to the public’s concerns regarding “lost and found.” Taking cues from Guangzhou, Kunming, and other locations, local management regulations should be developed to address the handling of lost items by taxi passengers. More so, active measures should be taken to assist passengers, including the implementation of open and transparent service procedures and the establishment of online platforms dedicated to lost and found items. Likewise, taxi drivers should consistently remind passengers to take their belongings when exiting the vehicle.

Secondly, the coping strategies for Topic 2: car blocking includes strengthening urban parking management, and taxi drivers should be encouraged to reserve parking spaces through mobile phone applications to prevent car blocking. In addition, practical challenges faced by taxi drivers in parking, dining, and restroom use should be addressed effectively. Going further, more parking spaces should be allocated specifically for taxis. To tackle the issue of social vehicles occupying taxi parking spaces, the district public security traffic police should increase their enforcement efforts, ensuring that taxis have designated locations for parking in an orderly manner. Equally, the transportation management department should make targeted adjustments to the allocation of special parking spaces for taxi vehicles based on regional passenger flow, ensuring efficient utilization of these spaces. It is therefore mandatory for every taxi driver to provide a contact phone number when temporarily parking on the road to ensure unimpeded communication.

Analogously in Topic 3 about attitude and behavior, various authors have discussed off-site law enforcement methods used by the Taxi Traffic Enforcement Corps [49]. To address dangerous driving violations, several appropriate measures can be implemented. This includes installing enforcement cameras at urban intersections [50], providing signal ahead signs, placing traverse rumble strips, and activating warning flashers. Also, taxi service providers should take appropriate measures to enhance service quality and customer satisfaction in the city [51], such as implementing specialized training programs for drivers [52], establishing a service process control mechanism [44], and implementing a credit system [53].

Nonetheless, in Topic 4 which deals with online car-hailing, issues concerning online car-hailing are primarily related to fare problems rather than the quality of service. Therefore, local authorities should take proactive measures by implementing policy regulations and guidelines to regulate and monitor the marketing, pricing, and product strategies of companies like Didi and other online hailing platforms. This ensures that they operate equitably without unfairly discomforting local taxi service providers.

By the same token, in Topic 5, unlicensed taxis are widely considered as major obstacles to city traffic regulation and public safety. Hence, governments have responded by imposing restrictions on car-hailing services and asserting that the use of unlicensed vehicles is illegal. However, the enforcement of traffic regulations against unlicensed taxis is difficult due to limited manpower and the time-consuming process of collecting on-site evidence [14]. To address this issue, Chen et al. [54] developed an unlicensed taxi detection model using an ensemble learning approach known as random forest (RF). Besides, Yuan et al. [55] proposed a smart model for identifying unlicensed taxis, which demonstrated efficiency and accuracy in real-life scenarios. These models provide valuable tools for governments to enhance traffic regulation and reduce associated costs.

Lastly, in Topic 6, fare issues were adequately addressed. The government can enhance its supervision of taxi fare issues by adopting innovative methods. Instead of relying solely on the traditional manual inspection of randomly selected taxis, which is time consuming and less manipulable, the government can leverage GPS trace data that is already being utilized in various areas such as city dynamics. By analyzing the GPS trace data, instances of taximeter fraud can be detected more efficiently and accurately. This approach not only saves time but also provides a more reliable means of detecting fraudulent practices.

Moreover, residents as consumers and government-as-providers collaborate in hotline systems, forming a co-production model [56]. The effectiveness and fairness of public service delivery through co-production processes rely on the active involvement and engagement of all stakeholders [9].

Previous studies have demonstrated that passenger complaint mechanisms can effectively reduce incidents of drivers refusing passengers or taking detours during carpooling [57]. However, it should be acknowledged that not all residents tend to report such incidents due to their limited engagement in political processes [58].

5. Conclusion

This study rigorously examined the academic research efforts on text mining and its applications by carrying out an exploratory literature study based on contemporary published sources. In the empirical section of the study, we elicited data from the transport service database 12328, with specific reference to textual information retrieved from the city of Hohhot, China, between 1st January and 31st December 2021. More importantly, the study utilized a method that involved extracting reports from the database of transport service 12328 to identify latent topics in the data.

To enhance the analysis, two dictionaries (i.e., a professional dictionary and a stop-word dictionary) consisting of over 200 terms were created based on technical and oral language. Through this process, six topics were identified in the hotline content of the city, offering valuable policy insights for the taxi industry in other cities with similar service providers. Our proposed approach provides an automated means to filter and categorize 12328 messages based on public concerns, serving as an effective alternative to traditional questionnaires and interviews, which can be subjective. Furthermore, this framework can be extended to address other types of issues and applied to other cities using appropriate data sources.

Despite providing a step towards utilizing hotline textual data, it is important to acknowledge the limitations of this research. Notably, the dictionaries constructed in this research, though helpful, are not exhaustive and may not sufficiently support the integration of more detailed information. In future investigations, it would be valuable to thoroughly explore a method for named entity recognition in the transport domain, allowing for the exploitation of richer word boundary information. Another limitation concerns the data that was elicited from text mining. Due to the elaborate display of the “12328” logo on taxi vehicles, issues related to taxis are more likely to be reported through this specific hotline number, since many individuals perceive it as a dedicated service line for the taxi industry. To address this, future studies could focus on analyzing additional data sources, such as social media platforms and website comments. This would enable researchers, policymakers, and industry practitioners to gain a more nuanced perspective, as well as apply a rich and thick description of the phenomenon ultimately leading to a comprehensive understanding of the topic.

Data Availability

The data that support the findings of this study are available from the Transportation Development Center of Inner Mongolia upon reasonable request.

Disclosure

A preprint of this research has been previously published.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (52072025).