Abstract

Diabetes mellitus is a disease with no cure that can cause complications and even death. Moreover, over time, it will lead to chronic complications. Predictive models have been used to identify people with a tendency to develop diabetes mellitus. At the same time, there is limited information regarding the chronic complications of patients with diabetes. Our study is aimed at creating a machine-learning model that will be able to identify the risk factors of a diabetic patient developing chronic complications such as amputations, myocardial infarction, stroke, nephropathy, and retinopathy. The design is a national nested case-control study with 63,776 patients and 215 predictors with four years of data. Using an XGBoost model, the prediction of chronic complications has an AUC of 84%, and the model has identified the risk factors for chronic complications in patients with diabetes. According to the analysis, the most crucial risk factors based on SHAP values (Shapley additive explanations) are continued management, metformin treatment, age between 68 and 104 years, nutrition consultation, and treatment adherence. But we highlight two exciting findings. The first is a reaffirmation that high blood pressure figures across patients with diabetes without hypertension become a significant risk factor at (OR: 1.095, 95% CI: 1.078-1.113) or (OR: 1.147, 95% CI: 1.124-1.171). Furthermore, people with diabetes with a (overall obesity) (OR: 0.816, 95% CI: 0.8-0.833) have a statistically significant protective factor, which the paradox of obesity may explain. In conclusion, the results we have obtained show that artificial intelligence is a powerful and feasible tool to use for this type of study. However, we suggest that more studies be conducted to verify and elaborate upon our findings.

1. Introduction

Diabetes mellitus (DM) is a metabolic disorder that causes abnormal blood glucose (BG) regulation, resulting in short and long-term health complications and even death if not properly managed. Unfortunately, there is no cure for DM [1, 2]. Hence, it is an increasingly prevalent chronic disease with patients prone to an increased morbidity and mortality rate [3, 4]. Mexico has a substantially high prevalence of metabolic disorders; 75.2% of the Mexican population is obese or overweight, 10.3% have DM, and 19.5% have dyslipidemia [5]. In addition, DM has been related to long-term neurological, microvascular, and macrovascular complications. Over time, it leads to neuropathy, nephropathy, retinopathy, and cardiovascular disease [6].

Complications from DM represent the leading cause of morbidity and mortality among the population with diabetes [7]. Therefore, generating information on risk factors for complications is highly relevant for developing tools that patients or physicians can use to anticipate the disease or its complications and take appropriate measures [8]. For example, the American Diabetes Association (ADA) diabetes risk test identifies the risk of developing DM2 [9], and the diabetic foot self-care questionnaire (DFSQ) helps to identify risks and prevent injuries or amputations of the feet [8].

Additionally, the use of prediction models to identify people at high risk of DM has been recommended by NICE [10], European guidelines for the prevention of type 2 DM [11], and the International Diabetes Federation [12]. Nevertheless, studies have yet to be conducted about DM, its complications, and risk factors [13].

Scientists can tailor learning models to find complex patterns within big data [14] using machine learning methods to effectively deliver integrative solutions for multiview data to explain an event or predict an outcome [15]. Furthermore, the ability of a model to find statistical patterns across millions of features and examples enables superhuman performance [16].

Due to the high prevalence of DM in the Mexican population and the lack of a model trained within such a population, we decided to develop a machine learning model on a national program’s electronic medical records for patients with diabetes. Our goal is to identify risk factors for patients with diabetes complications through artificial intelligence.

2. Materials and Methods

2.1. Ethics

We have conducted the study under the Declaration of Helsinki and Guidelines on Good Clinical Practice. Furthermore, the ethics committee of the Faculty of Medicine of the University of Colima approved the project under the deidentification of medical records (2020-01-02). The protocol also complies with the FAIR principles for handling scientific data [17] and the three Mexican laws for managing and accessing confidential data of third parties [1820]. Finally, we have followed the TRIPOD guidelines for the development and validation of predictive models [21].

2.2. Setting

We conducted a national nested case-control study [22, 23] with patients from 2011 to 2016 captured in Mexico’s electronic medical record in the public health sector (ISSSTE) at the national level.

2.3. Participants

We have included all patients older than 18 years of age, diagnosed with diabetes mellitus type 2, and captured in the electronic medical record of the national program for patients with diabetes of the public health sector ISSSTE of Mexico, included in the period from 2011 to 2016. In addition, we have excluded all records from outside the period and all patients with hypertension patients. Finally, we have eliminated all empty and duplicated records.

2.4. Outcome

We considered five complications: amputations, myocardial infarction, stroke, nephropathy, and retinopathy. They have all been joined into a category labeled as a chronic complication.

2.5. Cases

Every patient within the study population with a physician’s diagnosis of one or more of the five events of interest in their medical record.

2.6. Controls

Every patient within the study population with no physician’s diagnosis of any of the five events of interest in his medical record.

2.7. Predictors

Among the variables included, we noted information on addictions, obstetric-gynecological history, medications (antidiabetic, antihypertensive, and lipid-lowering), physical activity, acute complications (ketoacidosis, hyperglycemia, hyperosmolarity, and hypoglycemia), the performance of routine tests (blood, urine, eyes, and feet), education, marital status, infections, hospitalizations, symptoms, referral to specialty, among others.

2.8. Sample Size

The sample was nonprobabilistic, in accordance with the minimum of 10 observations for each variable [2426]. All patients diagnosed with diabetes mellitus type 2 treated by the ISSSTE MIDE program from 2011 to 2016 were included with the selection criteria specified above.

2.9. Missing Data

We deleted medical records containing empty records, and “no imputation” technique was used.

2.10. Statistical Analysis

All predictors used in the model were introduced in numerical form. The categorical variables were managed with dummy encoding, while the continuous numerical variables were stratified by quantiles [27, 28]. XGBoost is a machine learning algorithm based on decision trees with cross-validation. The patient ratio for training is 80% and 20% for validation. For each iteration on the dataset, the algorithm subselects a group that will act as an evaluation group. Throughout the whole training-testing process, we used a 5-element cross-validation. The evaluation group calculated the area under the curve, sensitivity, and specificity [29].

Machine learning models are usually seen as a “black box.” It takes some features as input and produces some predictions as output, “Shapley additive explanations” is the acronym for Shapley values, which are widely used in cooperative game theory. Essentially, Shapley’s value measures the contributions to the outcome of each variable separately within the dataset while maintaining that their sum equals the outcome [3033], which allows for a better interpretation of the results given by a machine learning model.

In addition, we have provided the odds ratio with its confidence interval, which provides a proper measure of association in case-control studies. In probability models, odds ratios are used to compare the relative probabilities that the outcome of interest will occur given exposure to the variable in question [34, 35].

3. Results

3.1. Preprocessing Phase

The initial database consisted of 1,852,766 records and 166 columns (predictors), with 97,044 patients receiving medical care from 2011 to 2016. After preprocessing, such as normalization, elimination of outliers, elimination of columns of identity (empty), data type conversion, and application of inclusion, exclusion, and elimination criteria, the final database consists of 234,847 records and 215 columns with a total of 63,776 patients only diagnosed with DM.

The reduction in the number of patients is mainly due to excluding patients diagnosed with hypertension. The reason why the number of columns is increased in the final database is due to preprocessing. We used percentiles to categorize continuous numerical columns to study the influence of the strata on the variables of interest. Thus, in the end, we have 153 binary columns, 31 categorical columns, and 31 numeric variables.

3.2. Patient Characteristics

Men comprised 40.42% of the study population and women represented 59.58%. Of the patients, 10.69% suffered from alcoholism and 12.2% from smoking. There are 13 antidiabetic medications, 32 antihypertensive medications, and 8 lipid-lowering treatments. While 46.78% of patients took at least one antidiabetic drug, 14.71% took antihypertensive drugs, and 26.84% took lipid-lowering drugs. Concerning self-care, 50.98% were attached to treatment, 14.03% measured glucose by themselves, 68.79% received education on diabetes mellitus, and 64.90% performed physical activity. Only 2.95% had been hospitalized during the period studied, with only 1.80% referring to a specialty consultation. In addition, 15.63% had acute complications, while 51.66% had chronic complications. 23.40% presented at least one type of infection (oral, genital, skin, or feet). Preventive examinations are as follows: 14.42% of eyes, 50.09% of feet, and 43.57% of urine.

While the average age of the patients was , the average diagnosis was , and the time with DM was . Anthropometrically, the population has a BMI of , a weight of , and a height of . Systolic blood pressure is , diastolic is , and pulse pressure is . The general lipid profile presents total cholesterol of , HDL , LDL , and triglycerides . Their glycemic control with HbA1c was , and their fasting blood glucose was .

For this study, we utilized 14 predictive models such as random forest, linear regression, support vector machines, and naive Bayes. We evaluated all of them with the same training and test data. In addition to having been assessed with the same metrics, we present them in the following table. Given these metrics, the best performant algorithm during the selection phase was the XGBoost model, which went into the thorough process of fine-tuning its hyperparameters. In Table 1, we show the results obtained by the XGBoost model to evaluate the model’s final performance to predict the development of chronic complications in nonhypertensive diabetic Mexican patients after the pipeline of fine-tuning.

3.3. Top Predictors from the XGBoost Model for Chronic Complications

Table 2 presents the variables that the model determined to be 10 variables with the most significant predictive power for the development of chronic complications in Mexican patients with diabetes and without hypertension according to their SHAP value together with their odds ratio (OR), their confidence interval, and their value.

The risk factors identified in descending SHAP order are continuing in management increases the odds by 1.75 times. Being treated with metformin increases the odds by 3.90 times. The age between 68 and 104 years increases the odds by 1.44 times. Having been referred and consulted about nutrition increases the odds by 1.66 times. Finally, attaching to the treatment increases the odds by 1.35 times (see Table 2).

3.4. In-Depth View of the Predictors for Chronic Complications

For a more detailed review of the results obtained from the model, refer to Table 3, which presents a breakdown of the total information competent to this model. The variables are grouped by topic for better cohesion and understanding of the results discussed below.

3.5. Additional Statistical Analysis

To complete the results section, we present two tables. In Table 4, where a Chi-squared test was performed for the binary variables, and Table 5, we performed the Mann–Whitney test for the numeric variables. Although many articles on machine learning mention having performed these tests, it is not required to publish those results.

4. Discussion

Despite current scientific advances, the vast amount of data prevents humans from extracting the maximum benefit. This has become the main limitation for scientific research since as researchers, we cannot operate at the scope, scale, and speed necessary for the amount and complexity of the collected data. For this reason, artificial intelligence is the most appropriate tool to make the most of all this data and transform it into helpful information for doctors and patients [17].

For example, it was previously mentioned that the first studies consulted had preselected the predictive variables through a literature review and not a statistical method [3639]. However, this study has allowed the exclusion and inclusion of variables to be performed through statistical discrimination that analyzes the machine learning model through the value of SHAP [3033].

Despite this, there are authorities such as the Centers for Disease Control and Prevention (CDC) that issue publications with reported risk factors for patients with diabetes: smoking, overweight or obesity, physical inactivity, systemic arterial hypertension, and hyperglycemia [40, 41].

Starting with smoking, we agree that this risk factor increases the risk of suffering any of the five events studied by 2.045 times. However, our study dove deeper into smoking as a risk factor. Stratifying the patient’s years of smoking, we find an almost constant risk ratio, where smoking from 1 to 56 years increases the risk 1.22-1.33 times.

Obesity has been recognized as a health risk factor for many years. The risk of morbidity and mortality associated with it is high, as is insulin resistance, which frequently causes diabetes. But a growing body of research indicates that patients with obesity have a higher chance of surviving [4244]. According to the so-called “obesity paradox,” persons who are obese survive health issues more often than people who are average weight. People with obesity may have the metabolic reserves required to balance the increased catabolic load during a health issue. In our analysis (see Table 3), we have found that a BMI greater than 32 (obesity) is a protective factor, which aligns with the obesity paradox theory. A BMI of less than 30 is presented as a risk factor.

Turning to physical activity, we found that this variable acts as a risk factor with an increase of 1.33 times the probability of suffering from a chronic complication as a person with diabetes. We understand that these results are controversial. However, we remind readers that this study is based on what was recorded in the patient files. Therefore, there was no way physicians ensured that the patients that carried out the said activity knew the type of activity, its duration, or its intensity. Furthermore, given the study’s observational nature, we limit ourselves only to reporting what was found, recognizing its limitations.

This study used arterial hypertension as an exclusion criterion for patients, so the hypertension effect cannot be reflected on the variables. However, we have the blood pressure data of the patients involved in this study. We found that diastolic pressure becomes a risk factor from 70 mmHg with an increase of 1.09-1.20 times the risk. On the other hand, systolic pressure is presented as a risk factor from 120 mmHg, increasing the risk 1.14-1.23 times. In addition, we calculated the pulse pressure, which was a risk factor from 40 mmHg with an increase of 1.04-1.20 times the risk. These results conclude that in patients with diabetes but without hypertension, high blood pressure does act as a risk factor even when there is no official diagnosis.

The JNC 8 recommends a therapeutic control goal in patients with diabetes and hypertension less than 140/90 mmHg without a clinical trial as evidence. Based on a national analysis of patients with nonhypertensive diabetics, we add evidence to this recommendation. Despite being a more stringent goal (120/70 mmHg), it seeks to prevent the same complications exposed by JNC 8 (mortality, myocardial infarction, and cerebrovascular events). We base this goal on the data presented in Table 3, which indicates that blood pressure in people with diabetes without hypertension is a significant protective factor at (OR: 0.84, 95% CI: 0.826, 0.853) or (OR: 0.974, 95% CI: 0.957, 0.991).

After obesity, arterial hypertension is the second most common cause of cardiovascular disease, which is the leading cause of morbidity and mortality. Therefore, preventing, diagnosing, and controlling high blood pressure are a global health priority only achieved by measuring blood pressure. In today’s modern world, self-measurement with validated devices would be beneficial to involve the patient in treatment and follow-up [45].

Thus, the CDC’s last variable is the risk factor hyperglycemia. In this case, we find that hyperglycemia recorded as an acute complication is a risk factor that increases the probability of a chronic complication by 4.5 times. However, we can again use the stratification of the numerical variables to provide specificity to these variables. Thus, fasting blood glucose is presented as a risk factor from 185 mg/dL, increasing 1.16-1.7 times the probability of suffering from a chronic complication. In addition, the fasting glucose gives us a view of these patients’ adaptation to the variation in their blood glucose. Studying it through HbA1c, it appears to be a risk factor from 6%, increasing 1.07-1.18 times the probability of a chronic complication.

In addition to the variables considered by the CDC, there are publications, such as Abhari et al. and James et al., mentioning that blood pressure variables should be studied independently with machine learning techniques [46, 47]. This was a reason we included the stratification of blood pressure and added pulse pressure.

Knuiman et al. reported that age is the primary variable related to complications of diabetes mellitus [38]. However, we believe that, in this case, we can provide specificity. Although the variable may be a risk factor, we found that age becomes a risk factor after 55 years, increasing the probability of a chronic complication by 1.01-1.44 times.

Ross et al. published that by including all the variables of the electronic medical record, there was a decrease in its predictive performance [48]. However, this decision was recognized as a strength of the study, given that it reflected the reality of the patients [4850]. Furthermore, we conducted the study without altering the database and without imputation of values—allowing the algorithm and accompanying techniques to study the discriminant variables for the model as recorded by physicians at the primary care level.

Additionally, in 2018, Dagliati et al. carried out a study with a similar approach to the present [51]. However, a bias was committed by not testing more than one machine learning algorithm. Moreover, before the study, they used logistic regression in a European population, which may reflect a different trend than a Hispanic population [51]. Therefore, for this case, we decided to carry out a fit test with 14 models of machine learning.

It is important to note that this study was conducted to identify risk factors for chronic complications, encompassing amputations, myocardial infarction, stroke, nephropathy, and retinopathy, without focusing on any particular affliction. Nevertheless, our results have generated information that supports future research aimed at the genesis of applicable instruments or strategies to benefit patients. In this sense, we must continue promoting the use of current and validated strategies and instruments for the care and prevention of complications in patients with DM, such as the determination of the degree of stiffness of the ankle, the use of the Foot Posture Index, or the use of the DFSQ, beneficial to prevent injuries or amputations of the pelvic limbs [8, 52].

We are no strangers to the wide variety of initiatives on machine learning focused on diabetes mellitus that have been launched since 2005 in the United States, Europe, Asia, and Australia. The Nordic Precision Medicine Initiative is the largest, formed in 2015, and intended to collect data from more than one million Nordic citizens [53]. Unfortunately, no publication in Mexico has included any initiative to make a comparable database. However, this project can provide a starting point for such an initiative. In addition, we demonstrate that Mexican programs already have value-rich data collection to extract insights for our Mexican patients.

4.1. Limitations

We recognize that all the subjects in this study share the same pathology, diabetes mellitus type 2, declared in the inclusion criteria. Therefore, the results cannot be generalized to all individuals, although the complications studied occur in other conditions.

Machine learning can be applied to medical records to develop robust risk models. However, our healthcare system is reluctant to entrust a machine with a human’s task with mistakes. However, if we change the vision, this may sound acceptable. Machine learning is not here to replace any human job. It is here to help as a second opinion based on our population’s thousands or even millions of records [54].

There is a dissociation between machine learning developers, regulatory bodies, health services researchers, doctors, and patients to generate new or more in-depth knowledge about the pathologies under study [39]. In addition, our project only covered patients served by one of the three public systems in the country; therefore, collection initiatives from the other systems would contribute significant value to this type of study.

4.2. Strengths

Most studies do not use longitudinal data, employ relatively few predictive variables with a median of 27, and rarely develop multicenter models [48]. In our case, the data is longitudinal, we start with 166 variables, and the data comes from all over the country.

Furthermore, this is the first Mexican national study that uses machine learning to predict complications in Mexican patients with diabetes mellitus. The published articles have indicated their scale, volume, and quantity of medical records as strengths [55]. However, this study is more extensive in scale, volume, and with respect to patient records because a national program is used as a data source.

Therefore, we can refine their results and compare them with ours.

This work intentionally ignored factors such as the researcher’s experience, the use of specific algorithms in the literature, dependencies of hypothetical variables, and the interpretability of the model [56]. Thus, favoring techniques such as recursive elimination of variables, cross-validation, and hyperparameterization improve performance without altering patient data.

In addition, we have used a hypothesis-free machine learning approach for the research question. Thus, a prediction model capable of detecting variables that had not been previously reported as risk factors for complications in this population of patients was created [57]. In some cases, this model resulted in the ability to specify the ranges in which a variable becomes a risk factor.

As far as we know, we have tried to extract as much information as possible from the data source and provide some new insights on the risk factors facing our Mexican population with diabetes mellitus.

4.3. Collecting

Although this section is separate from the structure, we included it because we know that we exposed a large amount of information, which was reflected in the tables and could not be discussed in the same way.

We remind readers that this study is not intended to find causes of effects but possible cofactors that lead to the effect. Therefore, we usually are brief in discussions to avoid misinterpreting the data. For this reason, we provide the tables with raw information as we obtained it after processing.

Many results may lack published support, but one of the study’s objectives was to show the tool’s scope with a certain degree of supervision without causing bias. We avoid theorizing or arguing through logic or experience that specific values do not adhere to what is published.

However, these contrasting results also support our starting premise. Each population provides different information. Perhaps the white European and North American people differ at a specific level from the Mexican population. Nevertheless, if we see it in general, the results are similar. Therefore, this study invites further studies with the Mexican people with artificial intelligence to provide our medical colleagues with accurate information about our patient population.

5. Conclusions

Artificial intelligence identified risk factors for complications in patients with diabetes with an AUC above 80%. Artificial intelligence provides a vision of medical data that is not influenced by the medical literature when utilized in exploratory medical data analysis. Various outcomes may result from this, such as new research, theories, or public health measures. The advent of personalized medicine appears increasingly likely given the large amount of data that can be exploited by this tool and the fact that the results are not all unexpected.

Data Availability

All relevant data appear in the present study. The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.

Ethical Approval

The study was conducted following the Declaration of Helsinki, and the Faculty of Medicine of the University of Colima approved the project under the deidentification of medical records (approval number R-2020-04, February 1st, 2020).

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

SAZF, IDE, and CMHS designed the study and wrote the manuscript. ALE, JDE, AGN, LMCA, JGE, MLMF, ISD, and IPRS did the statistical analysis. GCE, HODL, and FEG helped in the interpretation of the results. All authors contributed to the fieldwork and reviewed and approved the final version of the manuscript.

Acknowledgments

This study was funded by the University of Colima Faculty. This study used computer equipment funded by the National Council for Science and Technology of Mexico (CONACYT), Call for Frontier Science, Modality: Paradigms and Controversies of Science (author IDE).