Dimension Reduction Big Data Using Recognition of Data Features Based on Copula Function and Principal Component Analysis

Badakhshan Farahabadi, Fazel; Fathi Vajargah, Kianoush; Farnoosh, Rahman

doi:https://doi.org/10.1155/2021/9967368

Advances in Mathematical Physics

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Authors’ Contributions References Copyright Related Articles

Special Issue

Nonlinear Evolution Equations and their Analytical and Numerical Solutions

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 9967368 | https://doi.org/10.1155/2021/9967368

Dimension Reduction Big Data Using Recognition of Data Features Based on Copula Function and Principal Component Analysis

Fazel Badakhshan Farahabadi,¹Kianoush Fathi Vajargah,²and Rahman Farnoosh³

Academic Editor: Sandro Wimberger

Received12 Mar 2021

Accepted16 Jun 2021

Published12 Jul 2021

Abstract

Nowadays, data are generated in the world with high speed; therefore, recognizing features and dimensions reduction of data without losing useful information is of high importance. There are many ways to dimension reduction, including principal component analysis (PCA) method, which is by identifying effective dimensions in an acceptable level, reducing dimension of data. In the usual method of principal component analysis, data are usually normal, or we normalize data; then, the principal component analysis method is used. Many studies have been done on the principal component analysis method as a step of data preparation. In this paper, we propose a method that improves the principal component analysis method and makes data analysis easier and more efficient. Also, we first identify the relationships between the data by fitting the multivariate copula function to data and simulate new data using the estimated parameters; then, we reduce the dimensions of new data by principal component analysis method; the aim is to improve the performance of the principal component analysis method to find effective dimensions.

1. Introduction

In many real-world programs, reduction of high-volume data is of high importance and necessity as a prestage of data processing. For example, in data mining programs, dimensionality reduction is considered one of the most important stages to remove data redundancy, to increase precision of measurement, and to improve decision making process. Analyzing high-volume data is intrinsically difficult via high-volume computations for many learning algorithms as well as data processing. In dimensionality reduction methods, extraction of data features is highly important. A highly used method to reduce dimension reduction of data in data mining and in the data preparing phase is the principal component analysis method. The PCA method can be used if the original variables are correlated, homogeneous, if each component is guaranteed to be independent and if the dataset is normally distributed [1, 2]. The critical issues for the majority of dimensionality reduction studies are how to provide a convenient way to generate correlated multivariate random variables without imposing constrain to specific types of marginal distributions. An appropriate approach to this problem is to use Copula’s theory [3, 4]. In this paper, we first use the copula function to study the correlation and relationships between data to determine and eliminate irrelevant properties and simulate new data using the estimated parameter; then, by using the PCA method, we reduce the dimensions of data [4–6].

1.1. Principal Component Analysis (PCA)

Principal component analysis method has been first developed by Karl Pearson in 1901. The analysis includes analyzing special values of the covariance matrix. Analyzing principal components upon mathematics definition is an orthogonal transformation taking data to a new system of coordinates so that the largest data variance would be placed on the first coordinate axis; the second largest variance would be placed on the second coordinate axis and etc. Principal component analysis is aimed at transferring dataset with dimensions to data with dimensions. Therefore, it is assumed that matrix is formed of vectors each of which placed in column in matrix . So, the data matrix would be in form of . Principal components are just related to covariance matrix (correlation matrix ) of random variables [7].

1.2. Calculating Empirical Mean and Covariance Matrix and Data Normalization

To calculate covariance matrix, data have to be normalized first. To do so, the primarily vector of empirical mean would be calculated as follows:

Clearly, the empirical mean would be applied on matrix lines.

Then, the distance matrix to mean would be obtained as follows: where is a vector with size of and value equal to 1 in each of the entries.

Covariance matrix with dimensions would be obtained as follows: where is arithmetic mean, is an external coefficient, and is the matrix conjugate transpose.

Consider random vector and assume that this random vector has matrix covariance with special values . Consider following linear compositions:

Using relationship (4), we have

Its principal components are unrelated linear compositions; variances of which in relationship (5) would be large to the extent possible. The first principal component of a linear composition has maximum variance. Clearly, can be maximized through multiplying each by a constant. That is, the first principal component of linear composition is which maximizes with consideration of . The second principal component of linear composition is which maximizes with consideration of and , continuously to the ^th principal component.

According to relationship (5), we have and ratio of total variance to ^th component is

If for large , the highest maximum variance of total population (80 or 90%) could be attributed to the first several components; these components can be replaced by primary variables, losing not much information [2, 8–10].

2. Copula Function

In general, the copula function is the link function of multivariate distributions and their marginal distributions. The copula function is a multivariate distribution, marginal distribution which follows uniform distribution of [0,1] interval [11–13].

2.1. Characteristics of Copula Function

Assume the following characteristics for : (1)For every , , we will have (2)For every , we will have

Such function like implied in the two above conditions is called the copula function [14].

2.2. Sklar’s Theorem

It is indicated by Sklar’s theorem that if joint distribution function like would be available with marginal distributions and , then, there would be copula function available. That is, for every , , we have and if and would be continuous, then, copula function would be unique. Otherwise, would be defined as unique on .

The most important application of the copula function is formulation of a proper method to produce distribution of random related multivariate variables and to provide a solution for the problem of density estimation transformation [15].

For reversible transformation of continuous random variables based on their distribution function to independent variables with uniform distribution , , the probability density function would be equal to and joint probability density function , would be equal to . Therefore, probability density function can provide a nonparametric form (unknown distribution). Here, probability density function for would be estimated instead of , so that problem of density estimation becomes simpler. Then, it would be simulated so that random samples would be obtained through reverse transformation .

According to Sklar’s theorem, one copula function with unique dimensions is available in with uniform marginal distribution . That is, every function with margins can be written as follows:

To evaluate a copula function selected via an estimated parameter and to avoid defining any hypothesis on distributions, empirical distribution function can be used. An empirical copula function is useful to study the dependence structure of multivariate random vectors. In general, empirical copula function is as follows: where would be an indicator function [16].

2.3. Gaussian Copula Function

Difference between Gaussian copula function and normal joint distribution function is that the first one authorizes various distribution functions to be used for joint distribution [14]. However, in probability theory and statistics, normal multivariate distribution is considered the generalization of one-dimensional normal distribution [17].

Gaussian copula function is defined as where is a standard Gaussian function and has standard normal distribution and is a correlation matrix. As a result, copula function would be called a Gaussian copula function.

3. Methodology

In the research, a two-stage method would be used for dimensionality reduction. That is, primarily empirical copula function and fit of Gaussian copula function to data would be used to estimate parameter for variables . Important advantages of using the copula function in multivariate distributions is that correlation between variables would be considered by these functions, and in fact, there would be no need for independence of variables; instead, the correlation structure between variables would be even considered by these functions [18]. For estimation purposes, generating function is available with dependence unscaled value available in it. The correlation coefficient value has to be specified. To do so, the Pearson correlation coefficient will be used and defined as follows for two and variables: where and are standard deviations of and , respectively.

Then, those data with lower correlation compared to others would be eliminated and using estimated function and Gaussian copula function for , where uniform variables , would be generated and placed instead of in the principal component analysis method. After dimensionality reduction, the results would be compared through applying the method on raw data [16, 19].

4. Numerical Results

During past 30 years, increasing prevalence of urinary stone disease has been observed. About 80% of kidney stones are from calcium oxalate type. Here, 79 urine samples would be analyzed to see if some of physical features of urine are related to formation of calcium oxalate or not. These data include following columns (variables), which is available at https://cran.r-project.org/web/packages/cond.

Using Gaussian copula function, correlation values of variables would be obtained as follows:

Considering Table 1, it is observed that correlation of variable X2 is lower than other variables; so, it would be eliminated at the first stage. After estimation of parameters, new data would be generated. Figure 1 shows the copula function for main data and data generated by this method.

Now, data would be generated based on estimated parameters. To specify whether data are generated correctly or not, diagram QQPlot would be drawn.

Correct data generation is shown by Figure 2. In the second stage, after elimination of the X2 variable on data generated, principal component analysis would be done. In Figure 2, principal components for primary data and those generated by copula function are shown after reduction of the X2 variable. Figure 3 shows principal components for main data and the data generated.

Ratios of population variance related to principal components are provided in following table. Its screen plot is as follows.

Considering Tables 2 and 3 as well as Figure 4, it is observed that in dimensionality reduction method presented in the research, two first components include more than 80% of population variances and first component includes more than 70% of population.

Example 1. To recognize image resolution in a rectangular monitor, its display would be divided into different boxes and numbers of black and white dots in these boxes would be measured. Images of these characters have been made based on 20 different images, and each box from within these 20 boxes has been randomly selected. A file including 20000 unique simulators have been produced. Each stimulator has been transformed and scaled to 7 following numerical variables so that they would be placed within 0-15 range, (which is available at https://cran.r-project.org/web/packages/mlbench/index.html).

There are 2000 observations available from these variables.

Using Gaussian copula function, correlation values of variables would be obtained as follows:

X is the box. X1 is the horizontal location of box, X2 is the vertical location of box (y.box), X3is width of box (width), X4 is the height of box (height), X5 is the total numbers of dots in the box (onpix), X6 is the mean value of in dots of the box (.bar), and X7 is the mean value of in dots of box (.bar).

Considering Table 4, it is observed that correlation between variables X6 and X7 is less compared to other variables. So, these two would be eliminated at first stage and then Gaussian copula function would be fitted to reduced data and new data would be generated through estimated parameter, which is shown in Figure 5.

Now, data would be generated. QQPlot would be as follows.

Now, principal component analysis would be done on generated data. Diagrams of principal components are as follows (Figure 6).

Screen plot of population variance ratio related to principal components for both methods are as follows.

According to Tables 5 and 6 as well as Figure 7, it is observed that ratio of population variance for the first two components in the recommended method includes almost 85% of population and the first component includes almost 80% of population, whereas, for main data, ratio of population variance for the three first components includes almost 85% of population.

5. Conclusion

Considering the two aforementioned examples, it has been observed that data generated according to the estimated parameters of the Gaussian copula distribution are consistent with the original data (see Figures 2 and 8) by using the recommended method in the research and copula function to recognize dependencies and structural dependence between variables in addition to elimination of redundant data will increase efficiency of principal component analysis method as well as speed of obtaining analysis results (see Figures 4 and 7, Tables 2, 3, 5, and 6). Considering the point that nowadays data are generated with high-speed, appropriate, and efficient methods for dimensionality reduction without losing information are of high importance and necessity, and recommended method in the research is a useful one to do so. The recommended method in the research can be also used for other dimensionality reduction techniques so that data would be prepared for more analysis, for example in data mining.

Data Availability

The data that support the findings of this study are openly available at https://cran.r-project.org/web/packages/cond and https://cran.r-project.org/web/packages/mlbench/index.html.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

All authors contributed equally. All authors read and approved the final manuscript.

References

J. Forkman, J. Josse, and H.-P. Piepho, “Hypothesis tests for principal component analysis when variables are standardized,” Journal of Agricultural, Biological and Environmental Statistics, vol. 24, no. 2, pp. 289–308, 2019.
View at: Publisher Site | Google Scholar
I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, article 20150202, 2016.
View at: Publisher Site | Google Scholar
A. Colomé, G. Neumann, J. Peters, and C. Torras, “Dimensionality reduction for probabilistic movement primitives,” in 2014 IEEE-RAS International Conference on Humanoid Robots, pp. 794–800, Madrid, Spain, 2014.
View at: Publisher Site | Google Scholar
R. Fakoor and M. Huber, “A sampling-based approach to reducing the complexity of continuous state space pomdps by decomposition into coupled perceptual and decision processes,” in 2012 11th International Conference on Machine Learning and Applications, pp. 687–692, Boca Raton, FL, USA, 2012.
View at: Publisher Site | Google Scholar
I. M. Johnstone and A. Y. Lu, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009.
View at: Publisher Site | Google Scholar
D. Paul and I. M. Johnstone, “Augmented sparse principal component analysis for high dimensional data,” 2012, https://arxiv.org/abs/1202.1242.
View at: Google Scholar
L. I. Smith, “A tutorial on principal components analysis,” Tech. Rep., 2002, Computer Science Technical Report No. OUCS-2002-12), 2002, https://hdl.handle.net/10523/7534.
View at: Google Scholar
I. T. Jolliffe, “Principal components in regression analysis,” in Principal Component Analysis, pp. 129–155, Springer, 1986.
View at: Google Scholar
P. P. Markopoulos, S. Kundu, S. Chamadia, and D. A. Pados, “Efficient l1-norm principal-component analysis via bit flipping,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4252–4264, 2017.
View at: Publisher Site | Google Scholar
M. Zhai, F. Shi, D. Duncan, and N. Jacobs, “Covariance-based PCA for multi-size data,” in 2014 22nd International Conference on Pattern Recognition, pp. 1603–1608, Stockholm, Sweden, 2014.
View at: Publisher Site | Google Scholar
E. W. Weisstein, “Mathworld–a wolfram web resource,” 2004, https://mathworld.wolfram.com/Erf.html.
View at: Google Scholar
D. Lopez-Paz, J. M. Hernández-Lobato, and G. Zoubin, “Gaussian process vine copulas for multivariate dependence,” in International Conference on Machine Learning, pp. 10–18, Atlanta, Georgia, USA, 2013.
View at: Google Scholar
D. MacKenzie and T. Spears, “‘The formula that killed wall street’: the gaussian copula and modelling practices in investment banking,” Social Studies of Science, vol. 44, no. 3, pp. 393–417, 2014.
View at: Publisher Site | Google Scholar
R. B. Nelsen, An Introduction to Copulas, Springer Science & Business Media, 2007.
F. Durante, J. Fernandez-Sanchez, and C. Sempi, “A topological proof of Sklar's theorem,” Applied Mathematics Letters, vol. 26, no. 9, pp. 945–948, 2013.
View at: Publisher Site | Google Scholar
R. Houari, A. Bounceur, M.-T. Kechadi, A.-K. Tari, and R. Euler, “Dimensionality reduction in data mining: a Copula approach,” Expert Systems with Applications, vol. 64, pp. 247–260, 2016.
View at: Publisher Site | Google Scholar
D. MacKenzie and T. Spears, The Formula that Killed Wall Street? The Gaussian Copula and the Material Cultures of Modelling, School of Social and Political Science, University of Edinburgh, 2012.
A. Lipton and A. Rennie, Credit Correlation: Life after Copulas, World Scientific, 2008.
F. R. Pirolla, M. T. Santos, J. C. Felipe, and M. X. Ribeiro, “Dimensionality reduction to improve contentbased image retrieval: a clustering approach,” in 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops, pp. 752-753, Philadelphia, PA, USA, 2012.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Fazel Badakhshan Farahabadi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

749

Downloads

650

Citations