Abstract

In this paper, we present a different way to the standard methods to classify Raman spectra whose grouping process is based on a phenomenon of clustering observed in nature at the atomic level and correctly described by the statistical physics model known as the Potts model, which represents the interacting spins on a crystalline lattice. This clustering method is known as the super paramagnetic clustering (SPC), which allows identifying hierarchical structures in data banks. In this novel method, we assigned a Potts spin to each data point (Raman spectrum) and introduced an interaction between neighboring points whose coupling strength is a decreasing function of the distance between the nearest neighboring sites. We found a hierarchical tree structure in our data bank of Raman spectra allowing us to discriminate between the spectra from control and diabetes patients. The sensitivity and specificity of the diabetes detection technique by Raman spectroscopy were calculated directly because the SPC method achieves an accurate determination of the members of each cluster. As a cross-check, SPC results were compared with published results of multivariate analysis, observing excellent agreements; however, the SPC method allows determining the members of all identified clusters explicitly.

1. Introduction

In recent years, spectroscopic techniques such as Raman spectroscopy, Fourier-transform infrared spectroscopy, X-ray spectroscopy, and mass spectroscopy have become fundamental tools in the fields of chemistry, drugs, the agro-food sector, life sciences, and environmental analysis to study different biological systems based on the chemical and structural composition of biological samples [13].

In these techniques, once spectra are captured, mathematical tools to classify them are required; however, spectra corresponding to biological samples usually show a high complexity because they contain a large number of peaks of different intensities and forms, unlike spectra corresponding to nonbiological samples where discrimination between a pair of samples turns out to be relatively simple. Furthermore, the study of complex systems, where the comparison between a large set of spectra is necessary, has motivated the application of novel methods that allow identifying patterns in large banks of spectra.

Among the main techniques applied in the analysis of spectra, we have multivariate analysis (principal component analysis and linear discriminant analysis) [4, 5] and clustering analysis (K-means and spectral norm methods) [6]. Nevertheless, among these clustering methods, the ones that acquire particular interest are those methods that allow exploration of hierarchical structures in data banks, facilitating the study of diseases characterized by being classified into either different types or showing various stages of progress [4].

Among these hierarchical clustering methods, there is one that has brought particular interest because its clustering process is based on a phenomenon of clustering observed in nature at the atomic level, and it is correctly described by a statistical physics model known as the Potts model, which represents the interacting spins on a crystalline lattice. This method is known as the SPC method, which has already been successfully applied in the discrimination between leukemia, breast, and cervical cancer [7]. In the same way, this method has been applied to study gene expression [8, 9] and protein sequences [10] and even because the temporary evolutions of stock market returns are well described by random processes, SPC has also been used for the stock exchange analysis [11, 12].

In this paper, we propose the SPC method as a novel way to classify Raman spectra hoping to observe a hierarchical structure in the bank of spectra and identify Raman spectra corresponding to healthy and type 2 diabetes patients. SPC method and Raman spectroscopy could form a better method of diabetes detection with high sensitivity and specificity.

2. SPC Method

In the ferromagnetic model, each point is considered to have a Potts spin, equivalent to one of q integer values, si = 1, 2, …, q. The distance matrix, dij, represents the Euclidean distances between neighboring sites and . Input data for the SPC method are represented by this distance matrix containing all the distances between the data points. The distance matrix is used to construct a graph whose vertices are the data points, and edges correspond to connections between neighboring points. Two points are considered to be neighbors (and thus have an edge) if they are within the K-nearest neighbors of each other.

Pair of neighboring points and that has the same spin (si = sj) is interacting via a coupling of short-range:where dij is the Euclidean distance between points and , is the mean distance between interacting neighbors, and is the average number of interacting neighbors of a point [1315]. The strength Jij is a decreasing function of the distance dij so that the closer the two points are to each other, the more they like to belong to the same cluster, and the interaction between points that are not neighbors is set to zero.

The energy function of the system is given by the Hamiltonian of an inhomogeneous ferromagnetic Potts model:where the notation stands for neighboring sites and and the summation is over interacting neighbors. is the state of the system, and delta function, if and zero if . The thermodynamic average of a physical quantity A at a temperature T can be calculated using , where is the probability density of Boltzmann and , where Z is the partition function, .

A Potts system may have three different phases depending on the temperature and interactions: ferromagnetic, paramagnetic, or superparamagnetic phase. The system is ferromagnetic at low temperatures and paramagnetic at high temperatures. By increasing the temperature from zero, the system passes from the ferromagnetic to the paramagnetic state either directly in a single transition or via the intermediate superparamagnetic phase. This last phase is of considerable interest in the study of disordered systems, especially in the context of data clustering as clusters of aligned spins automatically divide the data into their natural classes, and a clear hierarchical structure among the classes emerges when varying the temperature.

The average spin-spin correlation function, , is used to decide whether or not two spins belong to the same cluster. In contrast, with the mere interpoint distance, the spin-spin correlation function is sensitive to the collective behavior of the system and is, therefore, a suitable quantity for defining clusters.

In this study, the SPC method, as Blatt et al. describe it [14, 15], was applied. Blatt et al. used the Swendsen–Wang Monte Carlo Simulation [16, 17] to generate a Markov chain in the Potts model. In the procedure, an initial configuration is generated by assigning a random value (spin) to each point. Subsequently, frozen bonds are assigned between nearest neighboring points and with a probability

Thus, subgraphs are connected by frozen bonds. Later, a new configuration is created, i.e., spins of each subgraph are assigned to a new spin value randomly chosen. Spins that belong to the same subgraph are assigned to the same value. It is repeated a maximum number of times.

To select the temperature in which the inherent emergence of clusters nested in hierarchies took place, the magnetic susceptibility or variance of the magnetization (m), , is calculated [18]. The peaks of χ indicate phase transitions: the transition between the ordered state (magnetic) and partially ordered state (superparamagnetic), as well as, the partially ordered state and the unordered state (nonmagnetic). Starting with low temperature and increasing the temperature, χ increases quickly when clusters begin to split. As the temperature is raised, the system may break first into two clusters, each of which breaks into more subclusters and so on. Such a hierarchical structure of the magnetic clusters reflects a hierarchical organization of the data into classes and subclasses.

After the clusters have been determined, the most natural clusters (clusters without substructures) are identified. The natural clusters were chosen using the sequential procedure proposed by Ott et al., which takes those clusters that have the largest T-range (denoted by Tcl) [19]. Ott defines a T-stability, ST, of a cluster aswhere Tmax is the temperature of the paramagnetic transition. Thus, ST expresses the stability of the cluster concerning the stability of the whole data set. This procedure stops in a branch if no more stable substructures can be found, i.e., if the most stable cluster detected is less stable than a threshold value Sϴ (ST < Sϴ). The natural clusters themselves do not have any substructures since they show a direct transition from the ferromagnetic phase to the paramagnetic phase, so the temperature that marks the end of the ferromagnetic phase, Tferro, is a good indicator of how natural a cluster is. Thus, Sϴ is the main control parameter that is set from outside.

3. Methodology

We applied the SPC method to study the hierarchical structure of the data bank whose elements are Raman spectra. The data bank is made up of 182 Raman spectra with 102 spectra from control patients and 80 spectra from diabetes patients. Each spectrum is composed of 2330 peaks with their respective intensities. The Raman spectra were measured from blood serum samples obtained from 15 patients who were clinically diagnosed with type 2 diabetes mellitus and 20 healthy volunteer controls. All patients were from the western central region of Mexico and had similar ethnic and socioeconomic backgrounds. In order to measure the Raman spectra, we focused a laser of 830 nm of wavelength (Jobin-Yvon LabRAM HR800 Raman apparatus) on different points of a small serum sample. To ensure statistically sound sampling, around five spectra from different regions of each serum sample were collected. Details of the samples used and spectra measured in the study are shown in Table 1.

Raw spectra were processed by carrying out baseline correction, smoothing, and normalization to remove noise, fluorescence, and shot noise [20]. Subsequently, a data matrix with N rows and D columns was constructed using the processed Raman spectra.

In the data matrix, each row represents a peak of the spectrum and each column a spectrum. The entries of the matrix are intensities of Raman spectra. Because we measured 182 spectra and all our spectra were measured in the same region of Raman shift, N = 2330 and D = 182 in the data matrix. The data matrix will allow studying the correlation between the spectra using the SPC method, that is, the existing relationship between the control and diabetes patients based on biochemical differences of blood serum samples.

The SPC method was implemented as described in Section 2. In the analysis, each processed Raman spectrum is represented by a point to which a Potts spin si is assigned. By using the Raman spectra as columns, the data matrix was constructed. The distance matrix dij was calculated using this data matrix. In the context of spectroscopy, only clusters of spectra with similar spectral profiles could occur.

The Swendsen–Wang Monte Carlo simulation to generate a Markov chain was implemented using the optimal settings of the parameters for the simulation, q = 10, K = 15 and [7, 10, 11, 21, 22].

Finally, the most natural clusters were determined taking the typical default threshold value, Sϴ = 0.5 [23].

The calculation of dij and SPC algorithm were implemented in MATLAB on the platform of Windows 10. The running time on a SONY SVS13AA11U was 35 minutes.

4. Results and Discussion

We tested the ability of the SPC method to determine the number of clusters in the bank of Raman spectra from diabetes and control patients. In order to compare the control and diabetes Raman spectra, the spectra were processed as it is described in the previous section; 2330 × 182 data matrix was constructed where the first 102 columns correspond to the spectra from control patients and the last 80 columns correspond to the spectra from diabetes patients (see Table 1). The 182 × 182 distance matrix was constructed using the data matrix.

A simple spectral comparison of the blood serum samples from the control and diabetes patients can be performed by analyzing the most characteristic bands of only the mean Raman spectra from control and diabetes patients; however, the most complete analysis that will allow classifying the samples taking into account all the peaks (2330) from the 180 spectra will be when SPC algorithm is applied.

Figure 1 shows the mean processed Raman spectra of diabetes and control samples. De Gelder et al. [24] formed a reference database of Raman spectra of biological molecules that allowed identifying each of the molecules corresponding to the peaks shown in the control and diabetes spectra. In these spectra, equally intense peaks were observed as 695 cm−1, the doublet of tyrosine at 828 and 853 cm−1, phenylalanine at 1002 and 1028 cm−1, the phospholipid shoulder at 1300–1345 cm−1, and proteins (amide I) at 1654 cm−1. The main differences were shown at 661 and 1404 cm−1 (glutathione), 714 (polysaccharides), 605 (phenylalanine), 545 cm−1 (tryptophan), and the shoulder of amide III at 1230–1282 cm−1 (this seems to disappear in the diabetes spectrum). On the contrary, the region 897–955 cm−1 highlighted because the diabetes spectrum peaks were more intense.

The intensities of the 2330 peaks from each measured Raman spectra (182) were recorded in our data matrix to calculate the distance matrix later, allowing the analysis of the similarity between all the spectra. Subsequently, the temperatures of superparamagnetic phases were determined by locating peaks of the magnetic susceptibility shown in Figure 2(a). Two superparamagnetic phases at temperatures T = 0.073 and T = 0.115 were observed, where the first divisions of the leading cluster took place. Figure 2(b) shows the distance matrix calculated for the SPC clusters in these transition phase temperatures. Most intense colors correspond to smaller distances between points. The diagonal and off-diagonal elements correspond to inter- and intracluster distances, respectively.

To determine the most natural clusters into which the leading cluster will be split, the Stoop method is applied to the SPC result, obtaining a hierarchical tree structure. Figure 3 demonstrates that the SPC method (K = 10) was able to determine the presence of three natural clusters in data correctly. In Figure 3, the two splits of clusters at temperatures T = 0.073 and T = 0.115 are observed, following what is shown in Figure 2. The leading cluster exhibited the first split into the clusters 1 and 2, and the cluster 2 showed the second split into the cluster 2 1 and 2 2.

In Figure 3 and Table 2, we observed that the leading cluster with 182 elements begins to split into cluster 1 with 95 elements and cluster 2 of size 87. These clusters essentially remained stable in their compositions until the superparamagnetic-to-paramagnetic transition temperature is reached (expressed in a sudden decrease of χ), and the cluster 2 split into the clusters, 2 1 (with size 76) and 2 2 (with size 11), while the cluster 1 remained without substructure (natural cluster). The clusters 2 1 and 2 2 remained unstructured, so they are also natural clusters.

Thus, the SPC method detected three natural clusters in the bank of Raman spectra labeled as 1, 2 1, and 2 2 in the tree diagram whose members are shown in Table 2. Each member indicates the column number in the data matrix, i.e., the number of the spectrum from one given patient. Recall that columns 1–102 and 103–182 correspond to the spectra of the samples from the control and diabetes patients, respectively. We can observe that the members of the clusters 1 and 2 correspond to Raman spectra from our control and diabetes patient groups, respectively. Later, cluster 2 was divided into the groups 2 1 and 2 2. This second split is consistent with the second peak in the magnetic susceptibility curve. The SPC method showed the results in such a way than the sensitivity and specificity were easily calculated, obtaining the number of true-positive, false-negative, true-negative, and false-positive cases in a less-biased way by merely observing the number of members of the SPC clusters in Table 2 and the number of spectra measured from control and diabetes samples provided by the health centers. According to this information, the number of true-positive (TP), false-negative (FN) (members indicated in green, Table 1), true-negative (TN), and false-positive (FP) (members indicated in yellow, Table 2) cases are 78, 2, 93, and 9, respectively.

Thus, we were able to detect differences between control and diabetes spectra using SPC with 97.5% sensitivity and 91.2% specificity. The sensitivity and specificity of the proposed method are also high compared with the detection method currently used.

It is important to note that when a cross-check is made using another classification method such as principal component analysis and linear discriminant analysis [5], the members 132 and 174 from clusters 1, and 88, 91, and 99 from cluster 2 1 are also misclassified, in perfect agreement with our SPC result, although there is a disagreement with the members 86, 92, 98, 100, 101, and 102 from cluster 2 2. Despite this disagreement in cluster 2 2, the method SPC, based on concepts of statistical physics and stochastic aspects, has high sensitivity and specificity consistent with the number of control patients and the number of patients from the health centers detected with high glucose concentrations.

On the other hand, due to the basic information we have about diabetes patients, we have a nonsatisfactory explanation on the split of cluster 2 into the substructures, clusters 2 1 and 2 2. Nevertheless, the presence of a healthy patient classified by SPC method as a diabetes patient (spectra 98, 99, 100, 101, and 102 correspond to the same healthy patient) suggests it could correspond to some very marked characteristics of the group from diabetes patients, such as a patient in a prediabetes stage (healthy patients with glucose concentrations close to those from a diabetic patient). Another possible explanation for the split is a wrong diagnosis using Raman spectroscopy and SPC method, as it happens in any other detection method.

Figure 4(a) shows the comparison of the average Raman spectra of the samples from healthy patients and one of the misclassified diabetes spectra (spectrum 132), marked with green in Table 2. The two spectra appear to contain the same Raman bands, only minimal differences in the intensities were observed, and therefore, Raman spectrum 132 was classified in the same cluster from healthy patients. On the other hand, Figure 4(b) shows the comparison of the average Raman spectra of the samples from diabetes patients and one of the misclassified control spectra (spectrum 100), marked with yellow in Table 2. The two spectra also appear to contain the same Raman bands with minimal differences in the intensities, so Raman spectrum 100 was classified in the same cluster from diabetes patients. One possible explanation for these facts is that the point of the blood serum sample of a healthy patient (diabetes patient), where the laser was focused, has chemical components almost identical to those at a point in the sample of a diabetes patient (control patient). It shows the importance of measuring as many spectra as possible by focusing the laser at different points throughout the sample, obtaining its complete characterization.

Based on the fact of the existence of these spectral differences, it could be interesting to study the transpose matrix of the data matrix by allowing the analysis of the correlation between the different Raman peaks, instead of the relationship between spectra. In this case, we would have clusters of peaks, where each cluster could identify specific molecules present in the samples, and several clusters of peaks inside a larger cluster would indicate that all those groups of molecules would maintain some chemical relationship according to the biochemical information reflected in Raman spectra of the samples from control and diabetes patients. Molecules in the same cluster with a known functional role may be used to infer the functional role of molecules that are in the same cluster and whose role was initially unknown. Consequently, the hierarchy of clusters obtained using the SPC method could contribute to the understanding of cellular biochemical behavior that gives rise to diabetes.

In addition, whether or not we added Raman spectra of serum samples from type 1 diabetes patients to our bank of Raman spectra from type 2 diabetes and control patients, SPC may have a more significant role in the diagnosis of diabetes types, i.e., discriminating directly between the type 1 and type 2 diabetes, hoping to observe again a hierarchical structure of clusters. We would observe that the leading cluster would split into two clusters, one corresponding to control patients and the other to diabetes patients. Furthermore, the cluster corresponding to diabetes patients would split into two clusters, one corresponding to type 1 diabetes patients and the other corresponding to type 2 diabetes patients. This SPC result could be of great interest in the biomedical field.

5. Conclusions

In this paper, we proposed the superparamagnetic clustering method as a different way to the standard methods for identifying patterns in large banks of spectra based on the spectra bands similarity. This method that uses the Potts spin model from statistical physics allowed to successfully discriminate diabetes spectra from control spectra with high sensitivity and specificity through a hierarchical structure of clusters. Nevertheless, although a split of the diabetes cluster into smaller clusters was nonsatisfactorily explained due the scarce biomedical information from the diabetes patient, a possible explanation could be associated with the fact of either the existence of a control patient with high glucose concentrations (prediabetes patient) or merely a wrong diagnosis using Raman spectroscopy and SPC method.

SPC method showed the results in such a way that the sensitivity and specificity were easily calculated, obtaining the number of true-positive, false-negative, true-negative, and false-positive cases in a less-biased way by merely observing the number of members of the SPC clusters and the number of spectra measured from diabetes and control samples provided by the health centers. As a cross-checking, SPC results were compared with published results of multivariate analysis, observing excellent agreements, but the SPC method explicitly determines the members of all identified clusters.

SPC could play an interesting role in the diagnosis of diabetes types, i.e., discriminating directly between the type 1 and type 2 diabetes, by observing a hierarchical structure of clusters from diabetes patients, that is, the leading cluster would split into two clusters, one corresponding to control patients and the other to diabetes patients, and the cluster corresponding to diabetes patients would split into two clusters, one corresponding to type 1 diabetes patients and the other corresponding to type 2 diabetes patients. These SPC results could be of enormous interest in the biomedical field.

Data Availability

The data used to support the findings of this study are included within the supplementary information file.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The authors wish to thank members from the Thematic Research Network of CONACYT, Soft Condensed Matter, for their comments and suggestions.

Supplementary Materials

(1) Data-Ramanspectra-Diabetes-Spcjournal of Spectroscopy.txt: it is a 2330 × 182 data matrix whose columns are Raman spectra. The data matrix is made up of 182 Raman spectra with the first 102 spectra from control patients and the next 80 spectra from diabetes patients. Each spectrum is composed of 2330 peaks with their respective intensities. (2) Cover Letter: brief description of the results reported in the article. (Supplementary Materials)