Abstract

Web service composition is one of the core technologies of realizing service-oriented computing. Web service composition satisfies the requirements of users to form new value-added services by composing existing services. As Cloud Computing develops, the emergence of Web services with different quality yet similar functionality has brought new challenges to service composition optimization problem. How to solve large-scale service composition in the Cloud Computing environment has become an urgent problem. To tackle this issue, this paper proposes a parallel optimization approach based on Spark distributed environment. Firstly, the parallel covering algorithm is used to cluster the Web services. Next, the multiple clustering centers obtained are used as the starting point of the particles to improve the diversity of the initial population. Then, according to the parallel data coding rules of resilient distributed dataset (RDD), the large-scale combination service is generated with the proposed algorithm named Spark Particle Swarm Optimization Algorithm (SPSO). Finally, the usage of particle elite selection strategy removes the inert particles to optimize the performance of the combination of service selection. This paper adopts real data set WS-Dream to prove the validity of the proposed method with a large number of experimental results.

1. Introduction

As big data develops, more and more users publish their resources in the form of Web services to promote the use of service. As a distributed computing model, which is self-contained, modular, and loosely coupled, Web services are characterized by being similar in functional attribute rather than nonfunctional attribute. Quality of Service (QoS) represents nonfunctional attribute in Web services, such as Availability, Price, and Reputation. With an ever-larger number of cloud services, selecting the optimal cloud service composition solution which satisfies user’s requirement has become a matter of great interest in the field of service composition [1]. The existing service selection methods which obtain best service composition solution are based on their QoS information. In [2], the service composition model is studied under Cloud Computing. In [3], a service composition for service level agreement (SLA) is proposed, in which a vague semantic preference is used as per user preference to select optimal services with the help of a new method. Work [4] presents a service selection method based on fuzzy logic, in which intelligent cloud storage is used and a lot of theoretical proof is given. In [5], a variety of hybrid services in heterogeneous clouds are used to perform service discovery and combination by Skyline operations. Work [6] presents a new approach, that is, to find a reliable dynamic service composition in two phases. Work [7] uses the weighted principal component analysis method to select the multimedia service.

These methods explore potential problems in the service composition and put forward solutions accordingly, though there are problems that need to be addressed like inefficiency in large-scale service composition selection when conducted in a Cloud Computing environment. Therefore, we propose a novel large-scale service selection method based on distributed computing environment, Spark [8], using the parallel particle swarm method to solve the service composition problem.

The contributions of this paper are summarized as follows.(1)Based on the service selection characteristics of big service, we propose SPSO service selection method. This method uses the combined potentials of Spark, covering algorithm and particle swarm algorithm. Spark is used for parallelization, covering algorithm for reducing the initial search space, and particle swarm algorithm for optimization of service selection. These three techniques are combined to solve the problem of large-scale service selection.(2)In the service selection, SPSO is mainly divided into three phases; in the first an efficient parallel algorithm is proposed to cluster the Web candidate service set, combined with the covering algorithm in neural network, to reduce the search space of candidate service set. Then, based on RDD parallel computing strategy, we realized the storage and parallel search for large-scale composite service. Finally, we use the new elite selection strategy to optimize the service selection capacity of population particles.(3)In order to reflect the effectiveness of the proposed method, we have implemented the parallelization of the contrast algorithm. A large number of simulation experiments have been conducted on real data set WS-Dream to verify the feasibility of solving large-scale service composition.

The rest of this paper is organized as follows: Section 2 introduces related work; Section 3 presents Web service composition model; Section 4 introduces the improved particle swarm method; Section 5 verifies the effectiveness of our approaches through simulating experiments. Finally, summary and future work are presented.

Service composition, mainly used in service-oriented architecture (SOA) and grid manufacturing, is a typical NP optimization problem. In literature previously mentioned, many scholars put forward various solutions to select appropriate services composition, such as improving the efficiency and quality of the service composition and reducing the size of the candidate service set. There are three ways to opt for a service composition: by local search, global search, or intelligent optimization algorithm. The local optimal method is to choose the best service in each candidate service set and then combine them [9]. However, the combination service may not be optimal. As for global search method, literature [10, 11] uses integer coding to solve the problem of service composition search, which is of great efficiency when the problem is small. In the cloud environment, however, the effectiveness of global search is weakened due to its poor scalability as the service composition business flow model becomes complex. To tackle scalability issue, the swarm intelligent optimization algorithm with high efficiency and fast speed is widely used in the service composition problem field. In literature [1], the bee colony algorithm is applied. The introduction of time enhancement function establishes a trusted service composition model, thus transforming service composition problem into a nonlinear integer coding problem. In [12], the correlation-aware service model is given, and the genetic algorithm is used to find the service composition in cloud manufacturing. In [13], a new gene coding as well as the differential evolution algorithm is used to find the service composition, which improves the convergence of the algorithm. Work [14] combines the advantages of FOA algorithm and genetic algorithm to find the combined service. In [15], the particle swarm algorithm is applied to the service composition in the cloud manufacturing.

However, in the Cloud Computing environment, previous methods of service selection may not be effective. Many scholars have proposed parallel service selection method to deal with large-scale service composition. In [16], from the perspective of Pareto optimality, partial selection strategy is used to precede QoS awareness service composition selection. The Pareto set model proposed in this paper has been theoretically proved effective first and then evaluated by a large number of experiments. In [17], a new large-scale service composition selection method, that is, the Hadoop distributed computing platform, is introduced. The discrete particle swarm optimization algorithm is combined with the Hadoop platform to select the service composition. In [18], parallel -means algorithm and particle swarm algorithm are used to select the service composition on the Hadoop platform. Despite the full use of its computational advantage, Hadoop parallel computing platform features inefficiency in reading data. As a memory-based cluster computing platform, Spark, widely used in distributed data processing, has the characteristic of Hadoop and optimizes it to abstract the distributed data into a flexible distributed data set RDD [19].

3. Model of Service Composition

These three important components of cloud service calculation, namely, Cloud Service Providers (CSPs), Cloud Broker (CB), and Cloud Consumers, serve different purposes. The process of combining service is mainly divided into the following steps:(1)Cloud Consumer publishes the requirements to CB, and CB receives the preference weight of QoS attribute , where represents the number of QoS properties.(2)CB divided task into multiple subtasks .(3)To fulfill subtask, CB selects a service from candidate services , in which . The services which address the same atomic task are classified as a set of candidate service. Services selected from each candidate service set constitute a composite service . The nonfunctional attribute of Web service can be represented as .(4)The service quality of the selected service is calculated based on workflow model.(5)Calculate the fitness of the composite service. Select the optimal service and give feedback to Cloud Consumer.

4. SPSO

The standard particle swarm algorithm, proposed by Eberhart and Kennedy in 1995, is a kind of evolutionary computation which originated from the study of bird predation [20]. In the process of searching, we start from a set of random solution, finding and updating the optimum solution in each iteration. In the search space, each particle represents a solution. The population migrates in parallel when moving. It is, therefore, viable to solve large-scale service composition problems by the parallelization of particle in distributed computing environment. The specific method is shown as in Figure 1.

When the population position is initialized, parallel covering algorithm is used to obtain multiple clustering centers as the starting point. Then, subpopulation migration is completed in Spark distributed computing environment; inert particles are removed through particle elite selection strategy. In the end, relatively optimal service composition is selected.

4.1. Coding Scheme

Firstly, the parallel particle swarm in the Spark cluster is encoded. The population in the RDD is encoded as shown in Figure 2, where the population is and is the number of particles. Each particle has recorded its information including current position, velocity, and historical optimum position. The task is divided into subtask , where represents the number of divided abstract subtasks and also indicates that the search space of the particles has dimensions. The specific coding mode is shown in Figure 2.

4.2. Initialization

The initial location of the population particles is a critical factor when it comes to population diversity. To randomly initialize the position of the particle, the use of particle swarm algorithm is highly apt to generate search inefficiency. This paper, therefore, uses the parallel covering algorithm [21] to cluster multiple candidate services based on their QoS properties and ensure that the population particles are randomly distributed in these initial starting points.

As a kind of clustering algorithm, covering algorithm, proposed by L. Zhang and B. Zhang on the basis of the neural network model, is developed from the idea that separates samples with less similarities for a set of fields. The QoS properties set of the Web service is , where is the number of properties. Each candidate service set is seen as a -dimensional point set. The main steps are as follows:(1)The center of gravity of all points unclustered is calculated by Euclidean distances. Select the nearest point from the center of gravity as the initial center.(2)Calculate the distance between the remaining points and the center. The average distance is applied as a radius, and distances which are less than the radius of the service clustered as a cover.(3)Calculate the distance between all unclustered points with the center. Select the farthest point as the new center and then recalculate the distance and take the average distance as the radius.(4)For the remaining unclustered points, the points whose distance with the center distance is less than the radius are screened as a new cover.(5)If there is any unclustered point left, repeat steps -.

As shown in Figure 3, is the circular coverage after being clustered, where the red dot is the cluster center. The size of each circular coverage is proportional to the number of services which it contains. Clustering centers of candidate service set can be obtained by applying covering algorithm.

Based on the Spark distributed computing environment, this paper uses the parallel covering algorithm to cluster the services in each candidate service set (Algorithm 1).

Input: WS
Output:
For    do
   While exist unclustered points
      If (without clustered center)
(5)         Compute distance and generate center
(6)         Compute Euclidean distance
(7)         Generate
(8)      End if
(9)      If (exist clustered center)
(10)         
(11)         Compute distance
(12)         Take remote point as new center
(13)         Compute Euclidean distance
(14)         Generate
(15)      End if
(16)   End while
(17) End for

The candidate service to be selected is seen as an RDD. After covering the clustering of the Web service in each candidate service set, there will be -dimensional coverage , where is the number of spherical coverage and is the number of QoS properties. Each -dimensional covering will have multiple clustering centers, and clustering centers . After clustering analysis, we can obtain multidimensional circular covering and clustering center.

4.3. Fitness Evaluation

The fitness value of each composite service has to be calculated. In the selection of Web services, the overall QoS of the composite service has a great impact on the service evaluation. The fitness is used as the evaluation of the Web combination service. The smaller the fitness, the better. The fitness function applied in this paper is

represents the preference of Cloud Consumer for the th QoS attribute of the composite service; is the total number of service QoS attributes; represents the th QoS attribute value of the composite service.

4.4. Parallel Particle Migration

After initialization, the cluster particles are encapsulated into an RDD. Suppose that there are particles in the population, and the search space is dimension which indicates that there are subtasks divided; the position of the th () particles can be expressed as , in the th generation, and each dimension of the position represents the selected Web service. is the flying speed of the particle. In the th generation, the individual recorded optimal position of the th particle search is ; the current optimal position of population is . In the ()th generation, the update formula for the th dimension and position of particle is shown as follows:where and are learning factors. and are random variables evenly distributed over the interval . is the inertia weight which measures the effect of the velocity of the migration on the next movement. The formula iswhere is the maximum inertia weight value and is the minimum inertia weight value. is the current evolutionary generation and is the total evolutionary generation. Generally, take , .

Particle population migration can be seen as the transformation of RDD, and the operation of selecting global optimal particle as the action during each iteration. The fitness of the best particle is broadcasted to population, and the population particles migrate to the next subpopulation until the migration ends.

4.5. Elite Selection Strategy

When using the Spark cluster to search the service composition, the diversity of the particles has a great influence on finding the optimal particle. After several searches, according to the search strategy, if the particle activity range is small, the optimization effect has limited effect on the whole population. This paper introduces the mechanism of particle elite selection, increasing the diversity of the population by removing the inert particles.

The specific idea of the mechanism is that when encoding particles, add parameters of the historical optimal position without changing the number of each particle. If the particle is not the optimal, and the historical optimal position of the particle is not updated to a certain threshold value , and the particle migration range is small, then the particle can be considered an inert particle. The historical best solution during the multiple migration process remains the same.

4.6. Algorithm Procedure

Based on the above analysis and design, the service composition optimization algorithm can be demonstrated as in Algorithm 2.

Input: , , , ,
Output:
Initiate particle swarm and compute fitness
For    do
   Update position and speed
(5)   Compute fitness
(6)   Update history information
(7)   If ()
(8)       Elite selection strategy
(9)   End if
(10) End for
(11)
(12) Generate the best particle

In Algorithm 2,   is the covering of multiple candidate service set including multiple clustering centers in each candidate service set. The population selects the random initial starting point according to these clustering centers. When the particles are reinitialized, the particle initialization is performed through the particle elite selection strategy.

5. Experiments

In this paper, we evaluated the efficiency of the improved particle swarm algorithm by comparing with the PSO algorithm [15]. The experimental value is the average of the 20 times of experiment.

5.1. Parameter Setting of Algorithm

Our experiments are initiated by a real-world service quality set WS-Dream [22], where more than 30 million Web services data as well as their quality values are collected. We chose the two properties as the QoS evaluation index, namely, response time (RT) and throughput (T). QoS preference weight is (0.5, 0.5).

Experimental Environment. Spark cluster consists of 9 nodes. We adopted Spark 1.4. The number of cores that can be used in the cluster is 72.

5.2. Effects of Spark Parameter

This experiment was carried out to test the effect of parallelism and core on algorithm in Spark cluster.

Firstly, by setting different parallelism, the effect of parallelism on the time consumption of two algorithms is investigated. In this paper, five subtasks under the service selection scenarios were taken as an example. Each subtask corresponding to the candidate service set entails 100000 services. The number of particles is 5000, and that of iterations is 500, and the total number of cores is 30. The results are shown in Figure 4.

In Figure 4, as the degree of parallelism increases, the time consumption of the two algorithms for service selection increases. When the degree of parallelism is set between 10 and 30, the time consumption is significantly less than that of between 40 and 60. The reason is that the parallelization of population particles is related to the idle cluster resources. When the degree of parallelism is 10–30, the population of particles is divided into 10–30 subpopulations, and the particles are migrated in parallel. When the degree of parallelism is 40–60, the particle migration can not be carried out in parallel because of the lack of available auditing resources, resulting in more time consumption.

Then, we examine the effect of the total number of cores on the consumption of the two algorithms. The candidate service has 100000 services, the number of iterations is 500, the degree of parallelism is 30, the number of particles is 5000, and the different number of cores is set. The results are shown in Figure 5.

As shown in Figure 5, it can be seen that, with the increase in the number of cores, the time for service selection is gradually reduced and stabilized. In the 10–30 stages of core, as the number of cores increases, the number of subpopulations that can migrate simultaneously increases while the time consumed decreases. When the core is 30–60, the cluster resources are sufficient; the time consumption tends to be stable.

5.3. Effectiveness

We tested the effect of the number of particles on the fitness value. This set of experiments tested 20 subtasks, in which candidate services are all set as 100000. In the Spark cluster environment, the parallelism is set to 20, the number of iterations is 500, and the total number of cores is 20. The results are shown in Figure 6.

From Figure 6, with the number of particles increasing, the selected combination of service improves while the fitness value of the overall trend was declining.

This experiment is investigated to evaluate the effect of the iteration number on the fitness value. The group of experiments tested 20 subtasks. The number of candidate services is set to 200000, the number of parallelism is 30, the number of particles is 5000, and the total number of cores is 30. The experimental results are shown in Figure 7.

It can be seen that, from Figure 7, SPSO is superior to PSO in terms of the ability of finding optimal solutions. The average fitness generally declined with the increase in the number of iterations.

5.4. Efficiency

In this experiment, we examine the efficiency of the SPSO algorithm. We tested the effect of the number of particles on the time consumption of the SPSO algorithm under different subtask. The number of different particles is selected in the experiment, and the number of candidate service sets is 100,000. Set the parallel number to 20 and the iterations number to 500. The experimental results are shown as in Figure 8.

Figure 8 shows that, as the number of particles increases, the more time it takes to complete the parallelization of the particles and the more time it takes for the service selection. Meanwhile, for the same number of population particles, time consumption increases as the number of subtasks increases.

6. Conclusion

In this paper, we proposed an improved particle swarm optimization algorithm in Spark cluster to solve the problem of Web service composition optimization in a big data environment. In the simulation experiment, we have studied the effects of Spark parameter and effectiveness of the improved algorithm. Experimental results show that the improved particle swarm optimization algorithm proposed in this paper outperforms other algorithms in Web service composition. For future works, we will examine the impact of service reliability on service selection issue in the context of large-scale service composition optimization.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Key Technology R&D Program (no. 2015BAK24B01), the General Research for Humanities and Social Sciences Project of Chinese Ministry of Education (no. 15YJAZH112), and the Educational Commission of Anhui Province of China (no. KJ2016A038).