Abstract

With respect to the cluster problem of the evaluation information of mass customers in service management, a cluster algorithm of new Gaussian kernel FCM (fuzzy C-means) is proposed based on the idea of FCM. First, the paper defines a Euclidean distance formula between two data points and makes them cluster adaptively based on the distance classification approach and nearest neighbors in deleting relative data. Second, the defects of the FCM algorithm are analyzed, and a solution algorithm is designed based on the dual goals of obtaining a short distance between whole classes and long distances between different classes. Finally, an example is given to illustrate the results compared with the existing FCM algorithm.

1. Introduction

Clustering is an unsupervised learning method that is not reliant on predefined classes and training datasets with class labels. Clustering objects are divided into classes or clusters on the basis of feature similarity measurement. Hence, the same clusters share high similarities within the same cluster but largely differ from each other between different clusters. Traditional clustering methods are primarily based on the partition, hierarchy, grid, density, and model. As data mining in rapid development necessitates higher requirements for clustering, a clustering algorithm based on sample attribution, preprocessing, similarity measurement, allocation and scheduling, update strategy, and measurement [1, 2] have been advanced and applied to data mining [3, 4]. Considering the fuzziness of membership between sample points and cluster centers, the objective function-based fuzzy c-means (FCM) algorithm still prevails in theory and practice.

The core of an FCM algorithm is to design and determine a clustering center. The design mainly consists of quantifying cluster centers, locating them, and scheming an objective function accordingly. The cluster centers are quantified manually in most cases, or their optimal amount is determined in a given range using information entropy and other methods. For example, Duan and Wang [5] indicated that the clustering center was acquired by multiattribute information with broken-line fuzzy numbers. A novel clustering algorithm, Nei Mu, was proposed in [6] based on which datasets are converted into data points of attribute space to construct a directed graph of K-nearest neighbors. This algorithm contributes to upgrading the clustering of data with large density fluctuation and an arbitrary distribution, but not all data points have K-nearest neighbors. Xue and Sha [7] initiated a coordinate-based density method using a gray prediction model of a clustering algorithm to determine the initial clustering center.

A clustering center should be determined with modifications in a dynamic process. The existing determination methods largely include K-means clustering algorithms, partition- and density-based clustering algorithms, clustering algorithms based on the local density of data points, and KZZ algorithms. For these, the K-means algorithm is used with a given initial center, whereas partition- and density-based clustering algorithms are used to determine the initial clustering center by a density function of sample points using max-min distance means or the maximum distance product method. Zhang and Wang [8] pointed out the nearest data points were bracketed to facilitate location of other clustering centers at the same time that a resolution with high constraints was added to the objective function. Chiu Stephen [9] defined measures for each data point to identify the initial clustering center. Agustin et al. [10] studied a group genetic algorithm, aiming to improve the performance of group clustering by coding and defining fitness functions. A semisupervised clustering algorithm was put forward in [11] via the kernel FCM clustering algorithm with clustering errors containing labeled and unlabeled data to design the objective function. Since FCM fails to deal with noise, an efficient kernel-induced FCM based on a Gaussian function was presented in [12] to improve the objective function.

The following cases are some of the existing FCM studies. Qian and Yao [13] focused on high sensitivity to the initial center point and introduced three incremental fuzzy clustering algorithms for large-scale sparse high-dimensional datasets. Niu and She [14] proposed a fast parallel clustering algorithm based on cluster initialization. By generating a hierarchical K-means clustering tree to autoselect the number of clusters, Hu [15] obtained better clustering results. Aiming at the high time complexity of traditional FCM algorithms, a single-pass Bayesian fuzzy clustering algorithm was advocated for large-scale data in [16], which boosted its performance in time complexity and convergence. Zhou et al. [17] introduced the neighborhood information of multidimensional data to improve the clustering algorithm, increasing the robustness of outliers and noise points. Chen and Liu [18] designed a clustering algorithm on the minimum connected dominating set to remedy the defect that common algorithms easily fall into local minimum points. Xie et al. [19] combined the GWO algorithm with the principle of maximum entropy in a multidimensional big data environment. Duan and Wang [5] described multiple attributes of the objects to be clustered as polygonal fuzzy numbers, and a clustering algorithm was designed accordingly. By advancing an adaptive algorithm for the entropy weight of the feature weight of FCM, Huang et al. [20] focused on the influence of the feature weight on a clustering algorithm. Taking the preference vectors’ clustering degree as a neighborhood similarity, Xu and Fan [21] aimed at constructing a heuristic clustering algorithm for multiattribute complex large group clustering and decision.

These documents focus on FCM algorithm-associated issues, but few achievements have been made in big data scenarios. In view of the differences between large data point clustering and small sample point clustering, the sample points of big data were simplified in this paper, making FCM more applicable for big data scenarios. Next, an FCM algorithm was designed by taking both long between-class distances and short inner-class distances into consideration, which traditional FCM algorithms failed to do. This study thus provides theoretical and practical guidance for data clustering in a big data environment.

2. Gaussian Kernel FCM Clustering Algorithm

Since service resources are generally allocated in multiple ways, and there is a reciprocal relationship between the limited resources in one channel of allocation and those in another in terms of resource quantity, group consistency is beyond reach in which different resource consumers prefer different channels, leading to changing evaluation data. If the price mechanism fails to optimize the service resource allocation, consumer demands should be considered while pursuing social benefits to attain higher efficiency of resource allocation. Consumers primarily feature heterogeneity, conflicts of interest, and differences in evaluation forms, which necessitate decomposition of the customer group to divide the large-scale consumer groups into several small clusters, thus simplifying resource coordination.

Suppose that the consumer subject of a service resource is expressed as , individual consumer as , number of channels (data dimension) as , evaluation data as , is the membership of sample in Class , with fuzzy matrix provided that there are classes, and represents cluster centers. Its objective function can be represented by the Gaussian kernel FCM clustering algorithm [8]:where , is a characteristic constant of the Gaussian function, and is a fuzzy index used to control the fuzzy degree of classification. The higher the index, the higher the fuzzy degree. is the variance of the given data. Hence,

If , then the iteration is discontinued, at which there is optimum classification. Both traditional FCM algorithms and the Gaussian kernel FCM clustering algorithm focus on the inner-class distance instead of between-class distance. Result-oriented, both values should be considered in order to obtain better clustering. Due to the large number of consumers to whom service resources are allocated, direct computing of membership will lead to problems such as high computing complexity and slow convergence of the optimal solution, resulting in a decline in clustering efficiency. Therefore, preprocessing of data points should occur prior to clustering in order to reduce the number of data points that need clustering and enhance the scalability of the clustering algorithm.

3. Preprocessing of Evaluation Information of Consumers

can be considered as a constraint data pair. The Euclidean distance formula is deployed to calculate its distance:

Set and in advance for (both and can take lower values for more accurate classification).(i)If , then it is considered that and are extremely close and can be placed into one class(ii)If , it is considered that is far from , and bracketing them together is next to impossible(iii)Data points between and cannot be effectively identified

To delete data points quickly, the characteristics of distances between data points and the possibility of clustering different data points should be considered and investigated in the preprocessing procedure. Deletion should be done via the following steps:(i)Step 1: take data points and with the smallest distance in to meet , and combine with and . Then, , and .(ii)Step 2: take the mean value of and by as a new data point, and identify data point in Set whose mean value is less than . Then, and .(iii)Step 3: take as a new data point with a value of . Repeat Step 2 until no new data points can be found, and form new sets and .(iv)Step 4: repeat Steps 1 to 3 for Set form Set , including and , wherein the final mean values of data points in are taken as new data points, respectively.(v)Step 5: let data points in be nodes in the graph based on graph theory, and the connecting line of nodes be the distance. If the distance is greater than , then the connecting line is deleted, thus forming a connected network graph. Assuming that points , and make a circle, and is the farthest from , it can be considered that there is a higher probability of forming a cluster by and than by and , or and , so the connecting line can be deleted. Here, Graph without cycles is the connected network graph.(vi)Step 6: in Graph with a plurality of nodes, the nodes are sorted by the number of adjacent points. Each node has a plurality of nondominated nodes to make a cluster.(vii)Step 7: since cluster sides vary in length, it is difficult to generate an effective cluster set for clusters with their average value as its sides. Deletion leads to basically equal distance from each data point in a cluster to the center of the cluster. Thus, point estimation may be adopted to work out its expected value. Given that Cluster has neighbors, its sample variance is , . An adaptive k-nearest neighbor algorithm is used to search a data point closest to or within a set distance from the given data point and merge it into clustering components for clustering fusion. Therefore, that satisfies the formula will be included in Cluster ; otherwise, the data point is deleted from . If included in multiple classes, the data point will enter the cluster as a matter of priority, where it is the nearest to the cluster center, so the average value of all data points in Cluster is data point .(viii)Step 8: data points are downsized using the approach mentioned above, and the pertinence of clustering by Set is strengthened. The original dataset becomes Set .

4. Clustering Algorithm of Consumer Evaluation

The number of clusters and the initial cluster center must be determined first to cluster by the FCM clustering algorithm. The former may be obtained by manually determining or defining an interval range and giving preference to the best cluster number. From the perspective of consumer clustering, the number of clusters is that of the evaluation channels, which is expressed as , since clustering better coordinates the needs of consumers that prefer different service channels.

The initial cluster center will change with the objective function value of the optimal fuzzy classification. However, it is difficult to meet the difference requirements between classes. In general, scholars add a penalty function to the objective function of the existing model (1) or similar models to maximize the between-class distance . However, the following problems are encountered:(i)The inner-class distance is a function of the distances between each data point and the cluster center and the power of membership, while the between-class distance is the average value of distance differences between cluster centers. Both vary widely in value and thus are beyond comparison. Their incorporation into an objective function (minimum value) may fail to accommodate maximizing the between-class distance and minimizing the inner-class distance in an iteration, but may focus on the former.(ii)Iteration termination occurs on the basis that understanding the difference of objective functions is within a specific range, with which the optimal cluster center and membership function can be obtained. As a likely nonconvex function with a local optimal solution, the objective function at the end of the iteration may not be solved, resulting in a small value, and there may be a small difference in the objective function values of two iterations and large values in another two. The algorithm cannot prove its convergence.

To maintain a short inner-class distance and long between-class distance, partitioning the two indexes and setting a more appropriate iteration termination condition should be necessary based on the above considerations. Then, determination of an optimized cluster center can proceed. The steps are as follows:(i)Step 1: as the clustering result is sensitive to the selection of the initial cluster centers, the distance between cluster centers should be increased as much as possible. The dominant point with the most neighbors in dataset is taken as the first cluster center, the data point farthest from the dominant point as the second, the data point with the largest product of the distance to the two cluster centers as the third, and so on, until initial clustering centers are solved.(ii)Step 2: calculate and , respectively, by equations (2) and (3). can be given or estimated by a sample variance: .(iii)Step 3: set the threshold of the inner-class distance and of the between-class distance. Variance is used to characterize between-class differences, where .(iv)Step 4: if and , then the iteration is terminated, and the obtained and are the most suitable membership function and cluster center, respectively.(v)Step 5 (sample classification): work out the distance between data points and each cluster center, and classify the minimum values.

5. Simulation Research

Given that a service resource targets a large number of consumers and may be allocated via five channels, a random sample survey was conducted on 100 consumers to seek their service evaluation data on each channel. The consumer group is clustered to pursue a more effective allocation of resources.

By the given steps, the possibility of clustering each data point (consumer) is preprocessed based on the evaluation data. Calculate the distance between two respective evaluation data points by formula (4), and :(i)Data close to each other are initially clustered by Steps 1 to 4 in Section 3 to obtain Set () composed of and , where in and are regarded as new data and each data in as separate data. Thus, the initial evaluation set is simplified to Set with 72 data points.(ii)By step 5, the data in Set are processed for connected graph without circles (see Figure 1) formed by elements in , wherein isolated points and points not on the main connected graph are not drawn.(iii)Process each node on Graph by employing steps 6 and 7 to conclude dataset (including 41 data points) on the basis of an adaptive nearest-neighbor classification rule. represents one or several data points in , whose partial relationship is shown in Table 1.(iv)Since the evaluation data involve 5 channels, find 5 initial cluster centers by Step 1 in Section 4: , , , , and .(v)Given , calculate and with Steps 2 to 4. Suppose and . The iteration stops, provided that , , and , so the cluster center is in its optimal state after 14 iterations: , , , , and .(vi)Perform Step 5 to classify samples, procuring a clustering result of Set .(vii)Get Set based on Table 2 and the corresponding relationship between Set and Set in Table 1. This is the clustering result of the original evaluation data, as shown in Table 3.

The distance (expressed by ) upon 33 iterations is shown in Table 4, according to [9].

Changes in are shown in Figure 2.

Figure 2 illustrates that the value of increases first, decreases next, and then increases and is free of monotone convergence. The clustering center may not be optimal when the iteration discontinues when , and a more suitable center satisfying the conditions may appear after iterations with smaller values of .

In this paper, the iteration ceased when and , where the additional condition ensured an appropriate distance between different classes and made it easier to attain appropriate values for and .

6. Conclusion

Complex huge group clustering is the basis for the effective distribution of service resources and group coordination; nevertheless, traditional FCM and its improved versions are incapable of processing numerous data points to be clustered. In this case, the deletion of data points was studied in the current paper by using a graph-based clustering algorithm, adaptive clustering algorithm, and Gaussian kernel clustering algorithm. Meanwhile, a new Gaussian algorithm was proposed to present both the inner-class distance and between-class distance, which the objective function fails to do.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.