Abstract

The aim of this study is to analyze the effect of serum metabolites on diabetic nephropathy (DN) and predict the prevalence of DN through a machine learning approach. The dataset consists of 548 patients from April 2018 to April 2019 in the Second Affiliated Hospital of Dalian Medical University (SAHDMU). We select the optimal 38 features through a least absolute shrinkage and selection operator (LASSO) regression model and a 10-fold cross-validation. We compare four machine learning algorithms, including extreme gradient boosting (XGB), random forest, decision tree, and logistic regression, by AUC-ROC curves, decision curves, and calibration curves. We quantify feature importance and interaction effects in the optimal predictive model by Shapley additive explanation (SHAP) method. The XGB model has the best performance to screen for DN with the highest AUC value of 0.966. The XGB model also gains more clinical net benefits than others, and the fitting degree is better. In addition, there are significant interactions between serum metabolites and duration of diabetes. We develop a predictive model by XGB algorithm to screen for DN. C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have great contribution in the model and can possibly be biomarkers for DN.

1. Introduction

Diabetes mellitus is an extremely common chronic disease. By 2045, the prevalence of diabetes will rise to 10.9% [1]. Of greater concern to us is that the Western Pacific will have the highest number of adult diabetics in the world [2]. In China, about 20-40% of diabetic patients have combined renal complications, and diabetic nephropathy (DN) has become the leading cause of end-stage chronic kidney disease [3]. Meanwhile, the all-cause mortality rate in patients with DN is nearly 20-40 times higher than that in nondiabetic nephropathy [4]. New screening and treatment methods have important implications for the prevention of diabetic nephropathy in the country.

In recent years, there has been a growing interest in metabolomic measurements to identify pathophysiological mechanisms and new diagnostic and prognostic biomarkers associated with disease development [5]. Among the various serum metabolites that have been extensively studied, amino acids and acylcarnitine have received much attention in recent years. Amino acids are involved in different physiological roles of the body, such as cell signaling, gene expression, nutrient metabolism, and endocrine hormone production [6]. There is research evidence that dysregulation of acylcarnitine homeostasis plays a role in the development and progression of various diseases, such as insulin resistance and metabolic syndrome [7, 8].

Since traditional clinical indicators and serum metabolites have a large number of features and are high-dimensional datasets containing both correlated and uncorrelated data, it is not sufficient to analyze such data using traditional statistical methods [9]. In recent years, machine learning methods, such as least absolute shrinkage and selection operator (LASSO) regression, support vector machine (SVM), decision tree (DT), random forest (RF), and artificial neural networks (NNs), have been widely used in healthcare [10], such as cancer, medicinal chemistry, and medical imaging [11]. Investigations have shown that machine learning can help improve the reliability, performance, predictability, and accuracy of diagnostic systems for diseases that require it and can be used to examine important clinical parameters, biological indicators, and serum metabolites [12, 13].

The purpose of this paper is to develop and test a prediction model for DN by using machine learning methods and the dataset of Dalian Second People’s Hospital and explain the prediction model to quantify the influence of serum metabolites to DN.

2. Material and Methods

2.1. Data
2.1.1. Data Source

Data for this paper including 1024 participants are obtained from April 2018 to April 2019 in the Second Affiliated Hospital of Dalian Medical University (SAHDMU). Demographic parameters; anthropometric, clinical, and laboratory parameters; medications; and disease conditions are extracted from the subjects through an electronic medical system. Demographics include age, sex, duration of diabetes mellitus, smoking, and alcohol consumption. Anthropometric measurements include body mass index (BMI), abdominal circumference (AC), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Clinical parameters included high-density lipoprotein cholesterol (HDL-C), fasting blood glucose (FBG), serum creatinine (SCR), and glycated hemoglobin (HbA1c). Disease conditions include hypertension, diabetic complications, and stroke. Medication use includes antidiabetic drugs, lipid-lowering drugs, laboratory parameters, and antihypertensive drugs.

2.1.2. Study Variables

BMI is calculated by dividing body weight (kg) by the square of height (m). The World Health Organization (WHO) classification criteria for BMI in Asia are as follows:  kg/m2 is considered underweight, normal weight is 18.5-24.0 kg/m2, overweight is 24.0-28.0 kg/m2, and obesity is >29.0 kg/m2 [14]. According to the recommendations of the American Diabetes Association [15], % is defined as hyperglycemia, and  mmol/L in men and  mmol/L in women were defined as dyslipidemia, all of which indicated that treatment goals were not met. The formula for calculating glomerular filtration rate (eGFR) is as follows: [16]

The overall statistical analysis process of this paper is shown in Figure 1. A preprocessing method is mainly included and investigated. The preprocessing process includes the elimination of missing values as well as feature selection, the optimization of hyperparameters using grid search, and the evaluation and analysis of classifiers. In addition, a 10-fold cross-validation is used to avoid the effect of dividing the training set and the test set differently.

2.2. Statistical Analysis
2.2.1. Data Preprocessing

The dataset used in this paper is the balanced dataset. In the prediction model, whether DN occurs or not is defined as a binary variable. Illness is denoted as 1; absence of illness is denoted as 0. The features with more than 50% missing values were excluded, and then, the samples with missing values were removed from the analysis (see Figure 2). In addition, in this paper, the features are divided into continuous and categorical variables for data preprocessing. They are normalized, if the features are continuous. The fetched values of the discrete features are extended to the Euclidean space using the unique hot coding (one-hot), if they are categorical, and there is no size significance between the fetched values.

2.2.2. Feature Selection

Feature selection was performed by using least absolute shrinkage and selection operator (LASSO) regression. The LASSO regression model improves the prediction performance by adjusting the hyperparameter to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction. To determine the best value, was selected by minimum mean error using 10-fold cross-validation.

2.2.3. Model Training and Validation

In this paper, the 10-fold cross-validation method is used to divide the training and testing sets; i.e., in each cycle, 9 subsets are used as the training set and 1 subset is used as the testing set. The model is optimized by using grid search. DN prediction models were using 10-fold cross-validation as a model evaluation strategy and four classification algorithms, extreme gradient boosting (XGB), random forest (RF), decision tree (DT), and logistic regression, respectively, mainly for predicting the risk of diabetic nephropathy in individuals.

The above models are evaluated based on their generalization ability and practicality. The generalization ability of the model is examined by the receiver operating characteristic (ROC) curve and the area under the curve (AUC) values of the model, and the clinical utility of the model was examined by using the decision curve and calibration curve.

3. Analysis of Results

3.1. Preprocessing Results

Through the above missing value processing (see Section 2.2.1), the final size of the dataset was obtained as (), which is a sufficient sample size to meet the statistical requirements and ensure the reliability of the study results [17, 18].

The clinical characteristics of the participants according to DN as a column stratified variable are shown in Table 1. The presence or absence of DN is statistically significant with HDL, Apo AI, C4DC, C5DC, HbA1c, and hypertension (). Compared with nondiabetic renal disease (NDRD), patients with DN tend to be without hypertension, with hyperglycemia, as well as have higher levels of HDL, Apo AI, and C5DC and lower levels of C4DC.

3.2. Feature Screening

Based on the “glmnet” package implementation in R language, the best performing features were screened from 70 clinical information and 49 metabolic indicators to reduce the dimensionality; therefore, the predictive performance of the classifier was significantly improved. After LASSO regression screening (see Figure 3), the best feature set was obtained including clinical information: diabetes duration, AC, SBP, hemoglobin concentration (HB), erythrocyte pressure volume (PCV), globulin (GLB), alkaline phosphatase (ALP), blood uric acid (UA), urinary microalbumin (MAU), cholesterol (CHOL), HDL, apolipoprotein AI (Apo AI), and Apo B (AI0B); insulin (INS), FBG, glutamic acid decarboxylase antibody (GADA), insulin sample growth factor-1 (IGF-1), free triiodothyronine (FT3), thyroid-stimulating hormone (TSH), eGFR, HbA1c, hypertension (high blood pressure was recorded as 1 and vice versa as 0), thiazolidinediones (TZDs), and Glinides (Glinides); lipid-lowering drugs, dipeptidyl peptidase-4 (DPP-4), glucagon-like polypeptide (GLP_1), and sodium-glucose co-transport protein 2 inhibitor (SGLT-2); amino acids including cysteine (Cys), methionine (Met), serine (Ser), and tyrosine (Tyr); and acylcarnitine including acetylcarnitine (C2), succinylcarnitine (C4DC), glutarylcarnitine (C5DC), and tetracosanoic carnitine (C24).

3.3. Hyperparameter Optimization Results

In this study, based on GridSearchCV in sklearn, for each combination in the hyperparameter combination list, four different machine learning models are instantiated, 10-fold cross-validation is done, and the parameter combination with the highest average score is returned using “roc_auc” as the scoring criterion, as shown in Table 2.

3.4. Classifier Results

Based on the preprocessed Dalian dataset, the four classifiers of XGB, RF, DT, and logistic regression were used to classify diabetic nephropathy, which showed that the XGB model (, ) was significantly better than the RF, logistic regression, and DT models. The AUC value of the DT model was greater than 0.8, but the false-positive rate was higher than the other three models, so it was not recommended (as shown in Figure 4).

The decision curve provides an adequate representation of the clinical utility of a model; i.e., at a certain threshold probability, the net benefit of the model is higher than the two special cases of no intervention for anyone and intervention for everyone at the same time, indicating that the model has practical value. As shown in Figure 5, all models were valid between the thresholds of 28% and 81%, and between the thresholds of 11% and 86%, the net benefit of the XGB model outperformed the other three models.

A new sample dataset was obtained by bootstrap method using Python 3.10 by sampling 10,000 times independently to plot the calibration curve of XGB model. As shown in Figure 6, after the XGB model was calibrated, the curve gradually approached the diagonal line, indicating that the screening is close to the real situation and has practical value.

3.5. Model Interpretation

The effect of features on screening scores is measured by SHAP, which evaluates the importance of each feature using a game-theoretic approach based on the test set [19]. When the Shapley value of each feature is positive, it indicates an increased risk of DN; conversely, it indicates a decreased risk of DN. The scattering colors in the figure indicate the magnitude of the feature values, with red being larger and blue being smaller. As shown in Figure 7, MAU, diabetes duration, PVC, FPG, and eGFR contributed more to the model; in the metabolite group, C2, C5DC, Tyr, Ser, and Met contributed more to the model.

When the duration of diabetes is greater than or equal to 15, the threshold value of Tyr that best describes the difference in outcomes is 45, at which point the higher the Tyr value, the lower the risk of DN (as shown in Figure 8(c)). In addition, patients with longer diabetes duration and lower C5DC values had a lower risk of disease compared to those with higher C5DC values; patients with longer diabetes duration and lower Tyr values had a higher risk of disease compared to those with higher Tyr values, or patients with lower C24 values and compared to those with higher Tyr values and longer diabetes duration; C24 vs. C5DC reasoning was the same (as shown in Figures 8(a) and 8(b)).

When most features are normal and for new-onset diabetes teenager patients, the risk of developing DN is low (Figure 9(a)). When the duration of T2D is shorter but most features (PCV, ALP, UA, FT3, and HDL) are abnormal, the risk of DN increases (Figure 9(b)).

4. Discussion

This study focuses on the metabolites, where C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have a strong effect on DN and can be used as new biomarkers for DN.

Aromatic amino acids are a group of α-amino acids that contain an aromatic ring, including phenylalanine, tyrosine, and tryptophan. Phenylalanine is oxidized to tyrosine by phenylalanine hydroxylase and then involved in glucose metabolism [20]. In a prospective study, lower plasma tyrosine levels in diabetic patients were associated with an increased risk of microvascular disease [21]. A previous study confirmed the association between low tyrosine concentrations and diabetic nephropathy [22].

Methionine is an essential sulfur-containing amino acid that is required for normal growth and development of the body and is also associated with %FM. It is a precursor of succinyl CoA, homocysteine, creatine, and carnitine, which the organism generally obtains from food or gastrointestinal microorganisms. Methionine plays a crucial role in the immune system because its catabolism leads to increased production of glutathione, taurine, and other serum metabolites [23]. Methionine and other methyl donors improve glucose tolerance and insulin sensitivity in the offspring of high-fat diet mice [24]. Experiments in rats have demonstrated that methionine ameliorates alterations in key one-carbon serum metabolites and T2D-induced disturbances in glucose and lipid metabolism in T2D rats [25]. And there is growing evidence that methionine activates AMPK and SIPT1 by a mechanism similar to that of metformin [26]. Given that diabetic nephropathy is one of the microvascular complications of type 2 diabetes, it is reasonable to speculate that methionine disorders are negatively associated with type 2 diabetes complicated by diabetic nephropathy.

Diabetes mellitus as a metabolic dysfunctional disease damages several organs and systems, including the liver, kidneys, and peripheral nerves. Although essential amino acids are important for maintaining normal physiological activities of the body, abnormal metabolism of nonessential amino acids is also associated with the pathogenesis of diabetes [27, 28]. Serine, a nonessential amino acid, levels have been found to be consistently reduced in patients with metabolic syndrome [29]. In a prospective study, elevated serum glycine levels were found to be associated with a reduced risk of developing type 2 diabetes [30]. Glycine being a precursor substance of serine [31], there is even more reason to speculate about the importance of serine in the microvascular complications of type 2 diabetes.

Numerous studies have found that homocysteine, a precursor substance of cysteine, is considered a biomarker for microvascular diseases including diabetic neuropathy, retinopathy, and nephropathy-like diseases [32]. Epidemiological studies have shown a U-shaped relationship between cardiovascular disease and cysteine after adjusting for other risk factors and homocysteine [33]. In this study, screening metabolic indicators associated with diabetic nephropathy by the LASSO model revealed a positive association between cysteine and diabetic nephropathy; the fact that no risk trend relationship was observed in the first half of the U-shaped curve may be due to the fact that this study was conducted based on type 2 diabetic patients, who have much higher levels of oxidative stress and reactive oxygen species than normal subjects.

Acylcarnitine is known to play a key role in the β-oxidation of long-chain fatty acids through the inner mitochondrial membrane. Comparing cases of obesity, insulin resistance, metabolic syndrome, and diabetes with relevant controls revealed that acylcarnitine was characterized differently between groups. A 6-year prospective study of 2103 community-dwelling individuals aged 50-70 years in Beijing and Shanghai, China, with type 2 diabetes as the observed outcome found higher plasma concentrations of short-, medium-, and long-chain acylcarnitines at baseline, but only long-chain acylcarnitines were significantly associated with the risk of type 2 diabetes [34]. A previous study found that elevated levels of short- and medium-chain acylcarnitines in blood were associated with the risk of developing cardiovascular disease in T2DM [35]. A study on diabetic peripheral neuropathy (DPN) claimed that C4DC and C24 concentrations in non-DPN plasma were significantly higher than in DPN patients and that factors containing C2, C3, C4, and C5 short-chain acylcarnitines were positively associated with the risk of DPN in T2DM [36]. C2 is derived from carbohydrate catabolism and acetyl-CoA, the end product of β-oxidation [37]. It was also found that C2 may be a biomarker of combined sugar and lipid toxicity. And animal experiments also showed that plasma C2 levels were elevated in T2DM rats [38].

Proteinuria and eGFR loss are both nonspecific markers of DN but have limitations as prognostic tools [39]. This is because a high percentage of T2DM patients in renal biopsy studies do not have DN and suffer from other renal diseases [40]. Therefore, it is important to identify new prognostic markers for DN based on serum metabolites in this paper. However, due to the limitation of data, this paper is limited to the dichotomous problem, and the multiclassification model for DN grade can be further investigated in the future.

5. Conclusion

This paper constructs a XGB model to screen for DN, whose predictive performance is better than those in previous studies [37, 41, 42] with 0.93, 0.79, and 0.90. LASSO plays a key role in ensuring the accuracy and stability of the predictive model, which improves the quality of the dataset. C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys are shown to be highly correlated with DN risk.

This paper introduces serum metabolites as new DN markers, constructs several machine learning models to screen for DN, compares their screening abilities, and analyzes the impact of each important feature on DN. The results show that the XGB model has the best screening effect, and LASSO model plays a key role in ensuring the accuracy and stability of the screening model, which improves the quality of the dataset. In addition, compared with previous studies [37, 41, 42], our model has better result.

Data Availability

The datasets generated during and analyzed during the current study are available from the corresponding authors on reasonable request.

Additional Points

Key Summary Points. Why carry out this study? (1) The prevalence of diabetic nephropathy has been increasing in recent years, but there are few screening methods for it. What was learned from the study? (i) The prediction model based on XGB algorithm shows that C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have high correlation with DN. (ii) Patients with longer diabetes duration and lower C5DC values had a lower risk of disease compared to those with higher C5DC values. (iii) Patients with longer diabetes duration and lower Tyr values had a higher risk of disease compared to those with higher Tyr values.

Disclosure

A preprint has previously been published [43].

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Authors’ Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Jing-Mei Yin, Yang Li, and Guo-Wei Zong. The first draft of the manuscript was written by Jing-Mei Yina, and all authors commended on the previous versions of the manuscript. All authors read and approved the final manuscript. Jing-Mei Yin and Yang Li have contributed equally to this work and share first authorship.

Acknowledgments

The authors thank all the doctors, nurses, and research staff at the SAHDMU in Dalian, for their participation in this study. This work was supported by the National Key Research and Development Program of China (2021YFA1301202), the National Natural Science Foundation of China (82273676), the Liaoning Province Scientific and Technological Project (2021JH2/10300039), the Education Department of Hunan Province (23B0178), and the Science & Technology Development Fund of Tianjin Education Commission for Higher Education (2022KJ204). This work was supported in part by the High Performance Computing of Xiangtan University.