Abstract

Existing approaches of cyber attack-defense analysis based on stochastic game adopts the assumption of complete rationality, but in the actual cyber attack-defense, it is difficult for both sides of attacker and defender to meet the high requirement of complete rationality. For this aim, the influence of bounded rationality on attack-defense stochastic game is analyzed. We construct a stochastic game model. Aiming at the problem of state explosion when the number of network nodes increases, we design the attack-defense graph to compress the state space and extract network states and defense strategies. On this basis, the intelligent learning algorithm WoLF-PHC is introduced to carry out strategy learning and improvement. Then, the defense decision-making algorithm with online learning ability is designed, which helps to select the optimal defense strategy with the maximum payoff from the candidate strategy set. The obtained strategy is superior to previous evolutionary equilibrium strategy because it does not rely on prior data. By introducing eligibility trace to improve WoLF-PHC, the learning speed is further improved and the defense timeliness is significantly promoted.

1. Introduction

With the continuous strengthening of social informatization, cyber attacks are becoming more frequent, causing tremendous losses to defenders [1]. Because of the complexity of the network itself and the limitation of the defender’s ability, the network cannot achieve absolute security. It is urgent to have a technology which can analyze the attack-defense behavior and effectively compromise the network risk and security investment so that the defender can make reasonable decisions with limited resources. Game theory and cyber attack-defense have a high degree of opposition, non-cooperative relationship, and strategic dependence [2]. The research and application of game theory in cyber security are rising day by day [3]. The analysis of attack-defense confrontation based on stochastic game has become a hotspot. Stochastic game is a combination of game theory and Markov decision making. It not only extends the single state of traditional game to multistate but also characterizes the randomness of cyber attack-defense. At present, the cyber security analysis based on stochastic game has achieved some results, but there are still some shortcomings and challenges [47]. The existing stochastic game of attack-defense is based on the assumption of complete rationality, through Nash equilibrium for attack prediction and defense guidance. Complete rationality includes many aspects of perfection requirements, such as rational consciousness (pursuit of maximum benefits), analytical reasoning ability, identification and judgment ability, memory ability, and accurate behavior ability, among which any aspect of imperfection belongs to limited rationality [8]. The high requirement of complete rationality is too harsh for both sides of attack-defense, which makes it difficult for Nash equilibrium under the assumption of complete rationality to appear in practice, and reduces the accuracy and guiding value of existing research results.

To solve the above problems, this paper studies the defense decision-making approach based on stochastic game under the restriction of bounded rationality. Section 2 introduces the research status of defense decision making based on stochastic game. Section 3 analyses the difficulties of studying cyber attack-defense stochastic game under bounded rationality and the idea of solving the problem in this paper. Moreover, Section 3 constructs attack-defense stochastic game model under bounded rationality constraints and proposes a host-centered attack-defense graph model to extract network state and attack-defense action in game model. Bowling et al. [9] first proposed WoLF-PHC for multiagent learning, Section 4 further improves WoLF-PHC algorithm based on eligibility trace for promoting the learning speed of defenders as well as reducing the dependence of the algorithm on data. Using the improved intelligent learning algorithm WoLF-PHC (Wolf Mountain Climbing Strategy) to analyze the stochastic game model in the previous section, we design the defense decision-making algorithm. Section 5 verifies the effectiveness of the proposed approach through experiments. Section 6 summarizes the full text and discusses future research.

There are three main contributions of this paper:(1)The extraction of network state and attack-defense actions is one of the keys to the construction of stochastic game model. The network state of the existing stochastic game model contains the security elements of all nodes in the network, and there is a “state explosion” problem. In order to solve this problem, a host-centered attack-defense graph model is proposed and an attack-defense graph generation algorithm is designed, which effectively compresses the game state space.(2)Limited rationality means that both sides of attack-defense need to find the optimal strategy through trial and error and learning. It is a key point to determine the learning mechanism of players. In this paper, reinforcement learning is introduced into stochastic game, which expands stochastic game from complete rationality to limited rationality. Defenders use WoLF-PHC to learn the game in adversarial attack-defense so as to make the best choice for current attackers. Most of the existing bounded rationality games use biological evolutionary mechanism to learn and take the group as the research object. Compared with the existing bounded rationality games, the approach proposed in this paper reduces the exchange of information among game players and is more suitable for guiding individual defense decision making.(3)WoLF-PHC algorithm is improved based on eligibility trace [10], which speeds up the learning speed of defenders, reduces the dependence of the algorithm on data, and proves the effectiveness of the approach through experiments.

Some progress has been made in cyber security research based on game theory at home and abroad, but most of the current studies are based on the assumption of complete rationality [11]. Under complete rationality, according to the decision-making times of both sides in the game process, it can be divided into single-stage game and multistage game. The research of single-stage cyber attack-defense game started earlier. Liu et al. [12] used static game theory to analyze the effectiveness of worm virus attack-defense strategy. Li et al. [13] established a non-cooperative game model between attackers and sensor trust nodes and gave the optimal attack strategy based on Nash equilibrium. In cyber attack-defense, although part of the simple attack-defense confrontation belongs to single-stage game, in most scenarios, the process of attack-defense often lasts for many stages, so multistage cyber attack-defense game becomes a trend. Zhang et al. [14] regarded the defender as the source of signal transmission and the attacker as the receiver and constructed a multistage attack-defense process using differential game. Afrand and Das [15] established a repeated game model between the intrusion detection system and wireless sensor nodes and analyzed the forwarding strategy of node packets. Although the above results can be used to analyze multistage attack-defense confrontation, the state transition between stages is not only affected by attack-defense action, but also by the interference of system operating environment and other external factors, which has randomness. The above results ignore this randomness and weaken its guiding value.

Stochastic game is a combination of game theory and Markov theory. It is a multistage game model. It can accurately analyze the impact of randomness on attack-defense process by using the Markov process to describe the state transition. Wei et al. [16] abstracted the cyber attack-defense as a stochastic game problem and gave a more scientific and accurate quantitative approach of attack-defense benefits applicable to the stochastic game model of attack-defense. Wang et al. [17] used stochastic game theory to study the network confrontation problem. Convex analysis theory was used to prove the existence of equilibrium, and the equilibrium solution was transformed into a nonlinear programming problem. Based on incomplete information stochastic game, Liu et al. [3] proposed the decision-making approach for moving targets defense. All the aforementioned schemes are based on the assumption of complete rationality, which is too strict for both sides of attack-defense. In most cases, both sides of attack-defense are only limited rationality level, which leads to the deviation of the above research results in the analysis of attack-defense game. Therefore, it has important research value and practical significance to explore the bounded rationality of cyber attack-defense game law.

Limited rationality means that both sides of attack-defense will not find the optimal strategy at the beginning. They will learn the game of attack-defense in the game of attack-defense. The appropriate learning mechanism is the key to win in the game. At present, the research of limited rational attack-defense game is mainly centered on evolutionary game [18]. Hayel and Zhu [19] established an evolutionary Poisson game model between malicious software and antivirus programs and used the replication dynamic equation to analyze the antivirus program strategy. Huang and Zhang [20] improved the traditional replication dynamic equation by introducing incentive coefficient and improved the calculation approach of replication dynamic rate. Based on this, an evolutionary game model was constructed for defense. Evolutionary game takes the group as the research object, adopts the biological evolution mechanism, and completes the learning by imitating the advantage strategy of other members. In evolutionary game, there is too much information exchange among players, and it mainly studies the adjustment process, trend, and stability of attack-defense group strategy, which is not conducive to guiding the real-time strategy selection of individual members.

Reinforcement learning is a classic online intelligent learning approach. Its players learn independently through environmental feedback. Compared with evolutionary biology, reinforcement learning is more suitable for guiding individual decision making. This paper introduces reinforcement-learning mechanism into stochastic game, expands stochastic game from complete rationality to finite rationality, and uses bounded rationality stochastic game to analyze cyber attack-defense. On the one hand, compared with the existing attack-defense stochastic game, this approach uses bounded rationality hypothesis, which is more realistic. On the other hand, compared with evolutionary game, this approach uses reinforcement learning mechanism, which is more suitable for guiding real-time defense decision making.

3. Modeling of Attack-Defense Confrontation Using Stochastic Game Theory

3.1. Description and Analysis of Cyber Attack-Defense Confrontation

Cyber attack-defense confrontation is a complex problem, but from the level of strategy selection, it can be described as a stochastic game problem as depicted in Figure 1. We take the DDoS attack of using Sadmind vulnerability of Soloris platform as an example. The attack is implemented through multiple steps including IP sweep, Sadmind ping, Sadmind exploit, Installing DDoS software, and Conducting DDoS attack. Each attack step can lead to change of security state of network.

Taking the first step as an example, the initial network state is denoted as S0 (H1, none). It means that the attacker Alice does not have any privileges of host H1. Then, attacker Alice implemented an IP sweep attack on H1 through its open port 445 and gained the User privilege of H1. This network state is denoted as S1 (H1, User). Afterwards, if the defender Bob selected and implemented a defense strategy from the candidate strategy set {Reinstall Listener program, Install patches, Close unused port}, then the network state is transferred back to S0; otherwise, the network may continue to evolve to another more dangerous state S3.

The continuous time axis is divided into time slices, and each time slice contains only one network state. The network state may be the same in different time slices. Each time slice is a game of attack-defense. Both sides detect the current network state, then select the attack-defense actions according to the strategy, and get immediate returns. Attack-defense strategies are related to network state. The network system transfers from one state to another under the candidate action of the attacking and defending sides. The transition between network states is not only affected by attack-defense actions but also by factors such as system operating environment and external environment, which is random. The goal of this paper is to enable defenders to obtain higher long-term benefits in attack-defense stochastic game.

Both sides of attack-defense can predict the existence of Nash equilibrium, so Nash equilibrium is the best strategy for both sides. From the description of complete rationality in the introduction, we can see that the requirement of complete rationality for both sides of attack-defense is too strict, and both sides of attacker and defender will be constrained by limited rationality in practice. Limited rationality means that at least one of the attacking and defending sides will not adopt Nash equilibrium strategy at the beginning of the game, which means that it is difficult for both sides to find the optimal strategy in the early stage of the game, and they need to constantly adjust and improve the strategy for their opponents. It means that the game equilibrium is not the result of one choice but that both sides of the attack-defense sides are constantly learning to achieve in the course of the attack-defense confrontation and because of the influence of learning mechanism may deviate again even if it reaches equilibrium.

From the above analysis, we can see that learning mechanism is the key to win the game of limited rationality. For defense decision making, the learning mechanism of attack-defense stochastic game under bounded rationality needs to satisfy the following two requirements: (1) Convergence of learning algorithm: attacker strategy under bounded rationality has dynamic change characteristics, and because of the interdependence of attack-defense strategy, the defender must learn the corresponding optimal strategy when facing different attack strategies to ensure that he is invincible. (2) The learning process does not need too much attacker information: both sides of the cyber attack-defense have opposition of objectives and non-cooperation, and both sides will deliberately hide their key information. If too much opponent information is needed in the learning process, the practicability of the learning algorithm will be reduced.

WoLF-PHC algorithm is a typical strategy gradient intelligent learning approach, which enables defenders to learn through network feedback without too much information exchange with attackers. The introduction of WoLF mechanism ensures the convergence of WoLF-PHC algorithm [9]. After the attacker learns to adopt Nash equilibrium strategy, WoLF mechanism enables the defender to converge to the corresponding Nash equilibrium strategy, while the attacker has not yet learned Nash equilibrium strategy, and WoLF mechanism enables the defender to converge to the corresponding optimal defense strategy. In conclusion, WoLF-PHC algorithm can meet the demand of attack-defense stochastic game under bounded rationality.

3.2. Stochastic Game Model for Attack-Defense

The mapping relationship between cyber attack-defense and stochastic game model is depicted in Figure 2. Stochastic game consists of attack-defense game in each state and transition model between states. The two key elements of “information” and “game order” are assumed. Constrained by bounded rationality, the attacker’s historical actions and the attacker’s payoff function are set as the attacker's private information.

Herein, we use the above example in Figure 1 to explain Figure 2; the security state corresponds to S0 (H1, none) and S1 (H1, User) in this case. The candidate strategy set against DDoS attack is {Reinstall Listener program, Install patches, Close unused port}. Network state is the common knowledge of both sides. Because of the non-cooperation between the attack and defense sides, the two sides can only observe each other’s actions through the detection network, which will delay the execution time for at least one time slice, so the attack-defense sides are acting at the same time in each time slice. The “simultaneous” here is a concept of information rather than a concept of time; that is, the choice of attack-defense sides may not be based on the concept of time. At the same time, because the attack-defense sides do not know the other side’s choice when choosing action, they are considered to be simultaneous action.

Construct the network state transition model. Use probability to express the randomness of network state transition. Because the current network state is mainly related to the previous network state, the first-order Markov is used to represent the state transition relationship, in which the network state is the attack-defense action. Because both sides of attacker and defender are constrained by bounded rationality, in order to increase the generality of the model, the transfer probability is set as the unknown information of both sides of attack-defense.

On the basis of the above, a game model is constructed to solve the defense decision-making problem.

Definition 1. The attack-defense stochastic game model (AD-SGM) is a six-tuple , in which(1) are the two players who participate in the game representing cyber attackers and defenders, respectively(2) is a set of stochastic game states, which is composed of network states (see Section 3.3 for the specific meaning and generation approach)(3) is the action set of the defender, in which is the action set of the defender in the game state (4)is the immediate return from state si to sj after the defender performs action d.(5) is the state-action payoff function of the defender indicating the expected payoff of the defender after taking action in the state (6) is the defense strategy of the defender in the state Defense strategy and defense action are two different concepts. Defense strategy is the rule of defense action, not the action itself. For example, is the strategy of the defender in the network state , where is the probability of selecting action , .

3.3. Network State and Attack-Defense Action Extraction Approach Based on Attack-Defense Graph

Network state and attack-defense action are important components of stochastic game model. Extraction of network state and attack-defense action is a key point in constructing attack-defense stochastic game model [21]. In the current attack-defense stochastic game, when describing the network state, each network state contains the security elements of all nodes in the current network. The number of network states is the power set of security elements, which will produce a state explosion [22]. Therefore, a host-centered attack-defense graph model is proposed. Each state node only describes the host state. It can effectively reduce the size of state nodes [23]. Using this attack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defense confrontation analysis.

Definition 2. attack-defense graph is a binary group , in which is a set of node security states and , is the unique identity of the node, and indicates that it does not have any privileges, has ordinary user privileges, and has administrator privileges. For directed edge , it indicates that the occurrence of attack or defense action causes the transfer of node state and , where is the source node and is the destination node.
The generation process of attack-defense map is shown in Figure 3. Firstly, target network scanning is used to acquire cyber security elements, then attack instantiation is combined with attack template, and defense instantiation is combined with defense template. Finally, attack-defense graph is generated. The state set of attack-defense stochastic game model is extracted by attack-defense graph nodes, and the defense action set is extracted by the edge of attack-defense graph.

3.3.1. Elements of Cyber Security

The elements of cyber security are composed of network connection , vulnerability information V, service information F, and access rights P. Matrix describes the connection relationship between nodes, the row of matrix represents the source node , the list of matrix represents the destination node , and the port access relationship is represented by matrix elements. When is empty, it indicates that there is no connection relationship between nodes and nodes. indicates the vulnerability of services on nodes’ host, including security vulnerabilities and improper configuration or misconfiguration of system software and application software. indicates that a service is opened on a node . indicates that an attacker has access rights on a node .

3.3.2. Attack Template

Attack template AM is the description of vulnerability utilization, where . Among them, is the identification of attack mode; describes the set of prerequisites for an attacker to use a vulnerability, including the initial access rights of the attacker on the source node , the vulnerability information of the target node, the network connection relationship C, and the running service of the node F. Only when the set of conditions is satisfied, the attacker can succeed. Use this vulnerability; describes the consequences of an attacker’s successful use of a vulnerability, including the increase of attacker’s access to the target node, the change of network connection relationship, and service destruction.

3.3.3. Defense Template

Defense templates DM are the response measures taken by defenders after predicting or identifying attacks, where . is the defense strategy set for specific attacks. describes the impact of defense strategy on cyber security elements, including the impact on node service information, vulnerability information, attacker privilege information, node connection relationship, and so on.

In the process of attack-defense graph generation, if there is a connection between two nodes and all the prerequisites for attack occurring are satisfied, the edges from source node to destination node are added. If the attack changes the security elements such as connectivity, the cyber security elements should be updated in time. If the defense strategy is implemented, the connection between nodes or the existing rights of attackers should be changed. As shown in Algorithm 1, the first step is to use cyber security elements to generate all possible state nodes and initialize the edges. Steps 2–8 are to instantiate attacks and generate all attack edges. Steps 9–15 are to instantiate defenses and generate all defense edges. Steps 16–20 are used to remove all isolated nodes. And step 21 is to output attack-defense maps.

Input: Elements of Cyber security , Attack Template , Defense Template
Output: Attack graph
(1)/∗ Generate all nodes ∗/
(2)for each do:/∗ Attack instantiation to generate attack edges ∗/
(3) update in /∗ Updating Cyber security Elements ∗/
(4)if and and and :
(5)  
(6)  
(7)  
(8)  
(9) end if
(10)end for
(11)for each do:/∗ Defense instantiation to generate defense edges ∗/
(12) if and :
(13)  
(14) end if
(15)end for
(16)for each do:/∗ Remove isolated nodes S ∗/
(17) if and :
(18)  
(19) end if
(20)end for
(21)Return
Input: ;
Output: Defense action d
(1)initialize , /∗ Network state and attack-defense actions are extracted by Algorithm 1 ∗/
(2) /∗ Getting the current network state from Network E ∗/
(3)repeat:
(4) /∗ Select defense action ∗/
(5)Output ; /∗ Feedback defense actions to defenders ∗/
(6) /∗ Get the status after the action is executed ∗/
(7)
(8)
(9) for each state-action pair except do:
(10)  
(11)  
(12) end for/∗ Update noncurrent eligibility trace and values ∗/
(13)/∗ Update of ∗/
(14) /∗ Update track of  ∗/
(15)
(16) Updating average strategy based on formula (6)
(17) Selecting the learning rate of strategies based on formula (5)
(18)
(19)
(20) /∗ Update defense strategy ∗/
(21)
(22)end repeat

Assuming the number of nodes in the target network is n and the number of vulnerabilities of each node is m, the maximum number of nodes in the attack-defense graph is 3n. In the attack instantiation stage, the computational complexity of analyzing the connection relationship between each two nodes is . The computational complexity of matching the vulnerability of the nodes with the connection relationship is . In the defense instantiation stage, we remove the isolated nodes, and the computational complexity of traversing the edges of all the nodes is . In summary, the order of computational complexity of the algorithm is . The node of attack-defense graph G can extract network state, and the edge of attack-defense graph G can extract attack-defense action.

4. Stochastic Game Analysis and Strategy Selection Based on WoLF-PHC Intelligent Learning

In the previous section, cyber attack-defense is described as a bounded rational stochastic game problem, and an attack-defense stochastic game model AD-SGM is constructed. In this section, reinforcement learning mechanism is introduced into finite rational stochastic game, and WoLF-PHC algorithm is used to select defense strategies based on AD-SGM.

4.1. Principle of WoLF-PHC
4.1.1. Q-Learning Algorithm

Q-learning [24] is the basis of WoLF-PHC algorithm and a typical model-free reinforcement learning algorithm. Its learning mechanism is shown in Figure 4. Agent in Q-learning obtains knowledge of return and environment state transfer through interaction with environment. Knowledge is expressed by payoff and learned by updating . iswhere is payoff learning rate and is the discount factor. The strategy of Q-learning is .

4.1.2. PHC Algorithm

The Policy Hill-Climbing algorithm [25] is a simple and practical gradient descent learning algorithm suitable for hybrid strategies, which is an improvement of Q-learning. The state-action gain function of PHC is the same as Q-learning, but the policy update approach of Q-learning is no longer followed, but the hybrid strategy is updated by executing the hill-climbing algorithm, as shown in equations (2)–(4). In the formula, the strategy learning rate iswhere

4.1.3. WoLF-PHC Algorithm

WoLF-PHC algorithm is an improvement of PHC algorithm. By introducing WoLF mechanism, the defender has two different strategy learning rates: low strategy learning rate when winning and high strategy learning rate when losing, as shown in formula (5). The two learning rates enable defenders to adapt quickly to attackers’ strategies when they perform worse than expected and to learn cautiously when they perform better than expected. The most important thing is the introduction of WoLF mechanism, which guarantees the convergence of the algorithm [9]. WoLF-PHC algorithm uses average strategy as the criterion of success and failure, as shown in formulae (6) and (7).

4.2. Defense Decision-Making Algorithm Based on Improved WoLF-PHC

The decision-making process of our approach is shown in Figure 5, which consists of five steps. It receives two types of input data: attack evidence and abnormal evidence. All these pieces of evidence come from real-time intrusion detection systems. After decision making, the optimal security strategy is determined against detected intrusions.

In order to improve the learning speed of WoLF-PHC algorithm and reduce the dependence of the algorithm on the amount of data, the eligibility trace is introduced to improve WoLF-PHC. The eligibility trace can track specific state-action trajectories of recent visits and then assign current returns to the state-action of recent visits. WoLF-PHC algorithm is an extension of Q-learning algorithm. At present, there are many algorithms combining Q-learning with eligibility trace. This paper improves WoLF-PHC by using the typical algorithm [10]. The qualification trace of each state-action is defined as . Suppose is the current network state , and the eligibility trace is updated in the way shown in formula (8). Among them, the trace attenuation factor is .

WoLF-PHC algorithm is an extension of Q-learning algorithm, which belongs to off-policy algorithm. It uses greedy policy when evaluating defense actions for each network state and occasionally introduces nongreedy policy when choosing to perform defense actions in order to learn. In order to maintain the off-policy characteristics of WoLF-PHC algorithm, the state-action values are updated by formulae (9)–(12), in which the defense actions are selected for execution because only the recently visited status-action pairs will have significantly more eligibility trace than 0, while most other status-action pairs will have almost none eligibility trace. In order to reduce the memory and running time consumption caused by eligibility trace, only the latest status-action pair eligibility trace can be saved and updated in practical application.

In order to achieve better results, the defense decision-making approach based on WoLF-PHC needs to set four parameters reasonably. (1) The range of the payoff learning rate is . The bigger the representative is, the more important the cumulative reward is. The faster the learning speed is, the smaller is and the better the stability of the algorithm is. (2) The range of strategy learning rate is obtained. According to the experiment, we can get a better result when adopting . (3) The attenuation factor of eligibility trace is in the range of , which is responsible for the credit allocation of status-action. It can be regarded as a time scale. The greater the credit allocated to historical status-action, the greater the credit allocated to historical status-action. (4) The range of the discount factor represents the defender’s preference for immediate return and future return. When approaches 0, it means that future returns are irrelevant and immediate returns are more important. When approaches 1, it means immediate returns are irrelevant and future returns are more important.

Agent in WoLF-PHC is the defender in the stochastic game model of attack-defense , the game state in agent’s state corresponds to , the defense action in agent’s behavior corresponds to , the immediate return in agent’s immediate return corresponds to , and the defense strategy in agent’s strategy corresponds to .

On the basis of the above, a specific defense decision-making approach as shown in Algorithm 2 is given. The first step of the algorithm is to initialize the stochastic game model of attack-defense and the related parameters. The network state and attack-defense actions are extracted by Algorithm 1. The second step is to detect the current network state by the defender. Steps 3–22 are to make defense decisions and learn online. Steps 4-5 are to select defense actions according to the current strategy, steps 6–14 are to update the benefits by using eligibility traces, and steps 15–21 are the new payoffs using mountain climbing algorithm to update defense strategy.

The spatial complexity of the Algorithm 2 mainly concentrates on the storage of pairs such as , , , and . The number of states is . and are the numbers of measures taken by the defender and attacker in each state, respectively. The computational complexity of the proposed algorithm is . Compared with the recent method using evolutionary game model with complexity [14], we greatly reduce the computational complexity and increase the practicability of the algorithm since the proposed algorithm does not need to solve the game equilibrium.

5. Experimental Analysis

5.1. Experiment Setup

In order to verify the effectiveness of this approach, a typical enterprise network as shown in Figure 6 is built for experiment. Attacks and defenses occur on the intranet, with attackers coming from the extranet. As a defender, network administrator is responsible for the security of intranet. Due to the setting of Firewall 1 and Firewall 2, legal users of the external network can only access the web server, which can access the database server, FTP server, and e-mail server.

The simulation experiment was carried out on a PC with Intel Core i7-6300HQ @3.40 GHz, 32 GB RAM memory, and Windows 10 64 bit operating system. The Python 3.6.5 emulator was installed, and the vulnerability information in the experimental network was scanned by Nessus toolkit as shown in Table 1. The network topology information was collected by ArcGis toolkit. We used the Python language to write the project code. During the experiment, we set up about 25,000 times of attack-defense strategy studies. The experimental results were analyzed and displayed using Matlab2018a as described in Section 5.3.

Referring to MIT Lincoln Lab attack-defense behavior database, attack-defense templates are constructed. Attack-defense maps are divided into attack maps and defense maps by using attacker host A, web server W, database server D, FTP server F, and e-mail server E. In order to facilitate display and description, attack-defense maps are divided into attack maps and defense maps, as shown in Figures 7 and 8, respectively. The meaning of defense action in the defense diagram is shown in Table 2.

5.2. Construction of the Experiment Scenario AD-SGM

(1) are players participating in the game representing cyber attackers and defenders, respectively(2)The state set of stochastic game is , which consists of network state and is extracted from the nodes shown in Figures 7 and 8(3)The action set of the defender is , , and the edges are extracted from Figure 8(4)Quantitative results of immediate returns of defenders [16, 26] are(5)In order to detect the learning performance of the Algorithm 2 more fully, the defender’s state-action payoff is initialized with a unified 0, without introducing additional prior knowledge(6)Defender’s defense strategy adopts average strategy to initialize, that is, , , where no additional prior knowledge is introduced

5.3. Testing and Analysis

The experiment in this section has three purposes. The first is to test the influence of different parameter settings on the proposed Algorithm 2 so as to find out the experimental parameters suitable for this scenario. The second is to compare this approach with the existing typical approaches to verify the advancement of this approach. The third is to test the effectiveness of WoLF-PHC algorithm improvement based on eligibility trace.

From Figures 7 and 8, we can see that the state of attack-defense strategy selection is the most complex and representative. Therefore, the performance of the algorithm is analyzed by the experimental state selection, and the other network state analysis approaches are the same.

5.3.1. Parameter Test and Analysis

Different parameters will affect the speed and effect of learning. At present, there is no relevant theory to determine the specific parameters. In Section 4, the relevant parameters are preliminarily analyzed. On this basis, the different parameter settings are further tested to find the parameter settings suitable for this attack-defense scenario. Six different parameter settings were tested. Specific parameter settings are shown in Table 3. In the experiment, the attacker’s initial strategy is random strategy, and the learning mechanism is the same as the approach in this paper.

The probability of the defender’s choice of defense actions and sums in state is shown in Figure 9. The learning speed and convergence of the algorithm under different parameter settings can be observed from Figure 9, which shows that the learning speed of settings 1, 3, and 6 is faster, and the best strategy can be obtained after learning less than 1500 times under the three settings, but convergence of 3 and 6 is poor. Although the best strategy can be learned by settings 3 and 6, there will be oscillation afterwards, and the stability of setting 1 is not suitable.

Defense payoff can represent the degree of optimization of the strategy. In order to ensure that the payoff value does not reflect only one defense result, the average of 1000 defense gains is taken, and the change of the average payoff per 1000 defense gains is shown in Figure 10. As can be seen from Figure 10, the benefits of Set 3 are significantly lower than those of other settings, but the advantages and disadvantages of other settings are difficult to distinguish. In order to display more intuitively, the average value of 25,000 defense gains calculated under different settings in Figure 10 is shown in Figure 11. From Figure 11, we can see that the average value of settings 1 and 5 is higher. For further comparison, the standard deviation of settings 1 and 5 is calculated one step on the basis of the average value to reflect the discreteness of the gains. As shown in Figure 12, the standard deviations of setting 1 and setting 6 are small. Moreover, the result of setting 1 is smaller than setting 6.

In conclusion, setting 1 of the six sets of parameters is the most suitable for this scenario. Since setting 1 has achieved an ideal effect and can meet the experimental requirements, it is no longer necessary to further optimize the parameters.

5.3.2. Comparisons

In this section, stochastic game [16] and evolutionary game [20] are selected to conduct comparative experiments with this approach. According to the difference of attacker’s learning ability, this section designs two groups of comparative experiments. In the first group, the attacker’s learning ability is weak and will not make adjustments to the attack-defense results. In the second group, the attacker’s learning ability is strong and adopts the same learning mechanism as the approach in this paper. In both groups, the initial strategies of attackers were random strategies.

In the first group of experiments, the defense strategy of this approach is as shown in Figure 9(a). The defense strategies calculated by the approach [16] are , . The defense strategies of [20] are evolutionarily stable and balanced. And the defense strategies of [20] are , . Its average earnings per 1000 times change as shown in Figure 13.

From the results of the strategies and benefits of the three approaches, we can see that the approach in this paper can learn from the attacker’s strategies and adjust to the optimal strategy, so the approach in this paper can obtain the highest benefits. Wei et al. [16] adopted a fixed strategy when confronting any attacker. When the attacker is constrained by bounded rationality and does not adopt Nash equilibrium strategy, the benefit of this approach is low. Although the learning factors of both attackers and defenders are taken into account in [20], the parameters required in the model are difficult to quantify accurately, which results in the deviation between the final results and the actual results, so the benefit of the approach is still lower than that of the approach in this paper.

In the second group of experiments, the results of [16, 20] are still as , and , . The decision making of this approach is shown in Figure 14. After about 1800 times of learning, the approach achieves stability and converges to the same defense strategy as that of [16]. As can be seen from Figure 15, the payoff of [20] is lower than that of other two approaches. The average payoff of this approach in the first 2000 defenses is higher than that of [26], and then it is almost the same as that of [26]. Combining Figures 14 and 15, we can see that the learning attacker cannot get Nash equilibrium strategy at the initial stage, and the approach in this paper is better than that in [26]. When the attacker learns to get Nash equilibrium strategy, the approach in this paper can converge to Nash equilibrium strategy. At this time, the performance of this approach is the same as that in [26].

In conclusion, when facing the attackers with weak learning ability, the approach in this paper is superior to that in [16, 20]. When facing an attacker with strong learning ability, if the attacker has not obtained Nash equilibrium through learning, this approach is still better than [16, 20]. If the attacker obtains Nash equilibrium through learning, this paper can also obtain the same Nash equilibrium strategy as [26] and obtain its phase. The same effect is superior to that in [20].

5.3.3. Test Comparison with and without Eligibility Trace

This section tests the actual effect of the eligibility traces on the Algorithm 2. The effect of eligibility traces on strategy selection is shown in Figure 16, from which we can see that the learning speed of the algorithm is faster when the qualified trace is available. After 1000 times of learning, the algorithm can converge to the optimal strategy. When the qualified trace is not available, the algorithm needs about 2500 times of learning to converge.

Average earnings per 1,000 times change as shown in Figure 17, from which we can see that the benefits of the algorithm are almost the same when there is or not any qualified trace after convergence. From Figure 17, we can see that 3000 defenses before convergence have higher returns from qualified trails than those from unqualified trails. In order to further verify this, the average of the first 3000 defense gains under qualified trails and unqualified trails is counted 10 times each, respectively. The results are shown in Figure 18, which further proves that in the preconvergence defense phase, qualified traces are better than unqualified traces.

The addition of eligibility trace accelerates the learning speed but also brings additional memory and computing overhead. In the experiment, only 10 state-action pairs that were recently accessed were saved, which effectively reduced the increase of memory consumption. In order to test the computational cost of qualified trail, the time of 100,000 defense decisions made by the algorithm was counted for 20 times with and without qualified trail. The average of 20 times was 9.51s for qualified trail and 3.74 s for unqualified trail. Although the introduction of eligibility traces will increase the decision-making time by nearly 2.5 times, the time required for 100,000 decisions after the introduction of eligibility traces is still only 9.51 s, which can still meet the real-time requirements.

In summary, the introduction of eligibility trace at the expense of a small amount of memory and computing overhead can effectively increase the learning speed of the algorithm and improve the defense gains.

5.4. Comprehensive Comparisons

This approach is compared with some typical research results, as shown in Table 4. [3, 12, 14, 16] is based on the assumption of complete rationality. The equilibrium strategy obtained by it is difficult to appear in practice and has a low guiding effect on actual defense decision making. [20] and this paper have more practicability on the premise of bounded rationality hypothesis, but [20] is based on the theory of biological evolution and mainly studies population evolution. The core of game analysis is not the optimal strategy choice of players, but the strategy adjustment process, trend, and stability of group members composed of bounded rational players, and the stability here refers to group members. This approach is not suitable for guiding individual real-time decision making because the proportion of specific strategies is unchanged and not the strategy of a player. On the contrary, the defender of the proposed approach adopts reinforcement learning mechanism, which is based on systematic feedback to learn in the confrontation with the attacker and is more suitable for the study of individual strategies.

6. Conclusions and Future Works

In this paper, cyber attack-defense confrontation is abstracted as a stochastic game problem under the restriction of limited rationality. A host-centered attack-defense graph model is proposed to extract network state and attack-defense action, and an algorithm to generate attack-defense graph is designed to effectively compress the game state space. The WoLF-PHC-based defense decision approach is proposed to overcome the problem, which enables defenders under bounded rationality to make optimal choices when facing different attackers. The lattice improves the WoLF-PHC algorithm, speeds up the defender’s learning speed, and reduces the algorithm’s dependence on data. This approach not only satisfies the constraints of bounded rationality but also does not require the defender to know too much information about the attacker. It is a more practical defense decision-making approach.

The future work is to further optimize the winning and losing criteria of WoLF-PHC algorithm for specific attack-defense scenarios, in order to speed up defense learning and increase defense gains.

Data Availability

The data that support the findings of this study are not publicly available due to restrictions as the data contain sensitive information about a real-world enterprise network. Access to the dataset is restricted by the original owner. People who want to access the data should send a request to the corresponding author, who will apply for permission of sharing the data from the original owner.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National High Technology Research and Development Program of China (863 Program) (2014AA7116082 and 2015AA7116040) and National Natural Science Foundation of China (61902427). In particular, we thank Junnan Yang and Hongqi Zhang of our research group for their previous work on stochastic game. We sincerely thank them for their help in publishing this paper.