Abstract

Although existing malicious domains detection techniques have shown great success in many real-world applications, the problem of learning from imbalanced data is rarely concerned with this day. But the actual DNS traffic is inherently imbalanced; thus how to build malicious domains detection model oriented to imbalanced data is a very important issue worthy of study. This paper proposes a novel imbalanced malicious domains detection method based on passive DNS traffic analysis, which can effectively deal with not only the between-class imbalance problem but also the within-class imbalance problem. The experiments show that this proposed method has favorable performance compared to the existing algorithms.

1. Introduction

With the rapid development of the Internet and information technology, network security threats are escalating, the security of cyberspace is becoming more and more complex and hidden, the risk of network security is increasing, and various network malicious attacks emerge endlessly. In these network malicious attacks, most of them are based on DNS (Domain Name System). The reason why DNS can provide an available infrastructure for attackers is that it is open and ease of use.

The core of the network malicious attack based on DNS is C&C (Command and Control) server. By means of the C&C server, the attackers can order remote hosts to perform malicious activities, such as spamming, phishing, DDOS (Distributed Denial of Service), and distributing malware which may be used to steal information, disrupt computer, extort money, etc. Therefore, it is urgent to detect this kind of malicious domain of C&C server and further take corresponding countermeasure.

It is very popular to employ the classification algorithm in machine learning to detect malicious domains in the current research [1, 2]. However, these existing studies pay no or little attention to the problem of imbalanced data. In fact, the actual DNS traffic is inherently imbalanced, in which most of the cases are benign and far fewer cases are malicious. As a result, this tends to construct an imbalanced training dataset in which there are many more samples of some categories than others. When learning from an imbalanced dataset, class information must be considered; otherwise the classifier will be overwhelmed by the majority classes and ignores the minority ones, and then the overall classification performance will undoubtedly be degraded. To address this shortfall, this paper will propose an imbalanced malicious domains detection method which can build malicious domains detection model by learning imbalanced dataset based on passive DNS traffic analysis.

In this paper we make the following contributions:(1)We especially focus on learning from the imbalanced data in the malicious domains detection field. And the latest research progress of learning from the imbalanced data in other fields is invited in the malicious domains detection field.(2)We construct the stronger discriminative features to profile malicious domains based on passive DNS traffic analysis.(3)We propose an improved imbalanced malicious domains detection method which is an extension of EasyEnsemble and demonstrate its favorable performance by the comparative experiments.

The remainder of this paper is organized as follows. In Section 2, we briefly review related work. Section 3 describes how to profile malicious domains based on passive DNS traffic analysis. We elaborate on an imbalanced malicious domains detection method in Section 4. Section 5 presents our comparative experiments of this new method. Finally, we conclude the paper in Section 6.

2.1. Learning from Imbalanced Data

Although rarely in network security, learning from the imbalanced data has already made considerable progress in other fields. In general, there are three ways to tackle the imbalanced learning problem. The first one is from the data perspective, which mainly uses resampling approaches to modify the class distribution of the data. The second one is from the algorithm perspective, which mostly focuses on optimizing various algorithms, such as SVM (Support Vector Machine), Decision Tree and Neural Network, based on cost-sensitive learning which considers the costs associated with misclassifying samples [3]. In addition, some researches also utilize one-class learning [4] which is particularly useful when used on extremely imbalanced data sets. The third one is from data feature perspective, which can build a fair feature space attaching much weight to the minority classes by means of some improved feature selection methods. This third approach is applied in many applications, including fraud/churn detection, text categorization, medical diagnosis, detection of software defects, and many others [5].

Most researches have been focused on the first approach, resampling which is more practical than the other two approaches. The resampling includes undersampling, oversampling, and the integration of undersampling and oversampling [6]. The key idea of undersampling is to remove the majority class samples from the original data set, and the key idea of oversampling is to append the minority class samples to the original data set.

The simplest resampling technique is random. But random undersampling can potentially rmove certain important samples, and random oversampling can lead to overfitting. Various improved undersampling algorithms, including EasyEnsemble and BalanceCascade, have been proposed [7]. Both methods utilize ensemble learning to overcome the deficiency of information loss introduced in the traditional random undersampling, since ensemble learning is based on multiple subsets which contain more information than a single one [8]. The famous improved oversampling algorithms are SMOTE (Synthetic Minority Oversampling Technique) [9] and its variants, such as Borderline-SMOTE [10] and ADASYN (Adaptive Synthetic Sampling) [11]. They devote to create the excellent artificial minority class samples using different strategies.

In practical application, when the samples of the minority classes are absolutely rare, oversampling is generally employed to increase the samples of the minority classes. Or else when the samples of the minority classes are relatively rare, undersampling is generally employed to decrease the samples of the majority classes.

2.2. Malicious Domain Detection Based on Passive DNS Traffic Analysis

The majority of detection methods based on DNS traffic are data-driven, most commonly having machine learning algorithms at their core. These methods require accurate ground truth of both malicious and benign DNS traffic for model training as well as for the performance evaluation [12]. The methods of DNS data collection can be generally divided into two subcategories: active and passive. Active method obtains DNS data by deliberately sending DNS queries and record the corresponding DNS responses, while passive method is passively to backup real DNS queries and responses.

Compare with active DNS data collection, passive DNS data collection is more representative and more comprehensive. As a result, the detection of malicious domain based on passive DNS traffic analysis has received increasing attention from the research community over the past decade. “Passive DNS” was invented by Weimer [13] in 2004. After that, many researchers have an insight into the important value of passive DNS when doing incident response investigations. And many passive DNS systems have developed, in which the most famous and popular one is DNSDB from Farsight Security. Farsight collects passive DNS data from its global sensor array, and then filters and verifies the DNS transactions before inserting them into the DNSDB [14]. The trends within this set are believed to be representative of Internet-wide trends and therefore provide valuable insight.

Antonakakis et al. [1] proposed a dynamic reputation system for DNS, called Notos, to automatically assign a low reputation score to a malicious domain. To measure a number of statistical features of a domain, Notos used historical DNS information collected passively from multiple recursive DNS resolvers distributed across the Internet. Bilge et al. [2] introduced a passive DNS analysis approach and a detection system, EXPOSURE, to detect domain names that are involved in malicious activities. The data that EXPOSURE used for the initial training consist of DNS traffic from the real-time response data from authoritative Name Servers located in North America and in Europe.

Perdisci et al. [15] presented FluxBuster, a novel detection system that used a purely passive approach for detecting and tracking malicious flux networks. FluxBuster is based on large-scale passive analysis of DNS traffic generated by hundreds of local recursive DNS (RDNS) servers located in different networks and scattered across several different geographical locations. Zhou et al. [16] proposed a model which can detect Fast-Flux Domains using random forest algorithm. It used passive DNS to log domain name query history of real campus network environment.

Analyzing these existing related works, we discovered that most of them are to collect DNS traffic in a period time to form a passive DNS set. This kind of passive DNS set is only a DNS data fragment and needs more collection cost. While DNSDB is relatively comprehensive, as a result, we determined to use the passive DNS traffic from DNSDB in this paper.

3. Profiling Malicious Domains Based on Passive DNS Traffic Analysis

To profile malicious domains, based on passive DNS traffic analysis we extract two groups features of malicious domains: static lexical features and dynamic DNS resolving features. Static lexical features mainly origin from the lexical information of domain name. Dynamic DNS resolving features are constructed based on DNS response attributes. Table 1 gives an overview of these features.

The results of statistical analysis of some features are selected to show in Figure 1. From these, we can find that these features have the stronger ability to distinguish the malicious domains from the benign ones.

In this section, we will present 12 static lexical features and 4 dynamic DNS resolving features and the motivation that we construct these features to profile malicious domains.

3.1. Static Lexical Features

To avoid detection, the attackers generally employ domain generation algorithms (DGA) to dynamically produce a large number of random domain names. The lexical features of these malicious domain names are largely different from benign domain names. We construct 12 static lexical features to profile malicious domains.

So far the short domain names have been almost registered; therefore the majority of malicious domain names generated by DGA are longer than benign domain names. And max length of labels (i.e., parts delimited by dots) in subdomain of malicious domain names is also commonly longer. So we construct two features based on the length measure: first, length of domain name (Feature 1), and second, max length of labels in subdomain (Feature 2).

The most distinctive property of domain names generated by DGA is that the distribution of characters is random. We know that information entropy is defined as the average amount of information produced by a stochastic source of data [17]. So, we employ information entropy to measure the disorder of characters.

Let d be a domain name and m be the number of distinct characters in d. We define entropy (d) as character entropy of d (Feature 3).

where means a character in , is the number of in , and is the length of ,.

If the character entropy value of is greater, then more likely will be identified to be malicious.

In addition, malicious domain names are used by malwares not by human, so they are not easy-to-remember or human pronounceable. Thus the appearance of numerical and alphabetic characters in malicious domain names is also very important indicative signs. With this insight, we construct five features as follows: number of numerical characters (Feature 4), ratio of numerical characters (Feature 5), conversion frequency of numerical and alphabetic character (Feature 6), max length of continuous numerical characters (Feature 7), max length of continuous alphabetic characters (Feature 8), and max length of continuous same alphabetic characters (Feature 9).

As we all know, the consonant letters in the English alphabet are much more than the vowel letters. Therefore, in random malicious domain names, the ratio of vowels (Feature 10) is smaller, the length of continuous consonants (Feature 11) is longer, and conversion frequency of vowel and consonant (Feature 12) is very higher.

3.2. Dynamic DNS Resolving Features

The Internet-scale attacks using DNS leave unavoidably a trail of footmarks which are hidden into the DNS resolving records, so we may mine these footmarks (i.e., DNS resolving features) to profile malicious domains. In this section, we will present 4 dynamic resolving features origin from the DNS resolving records.

In order to evade blacklists and resist takedowns, the DNS answer that is returned by the server for a malicious domain generally consists of multiple DNS A records (i.e., Address records) or NS records (i.e., Name Server records). And the slippery attackers do not usually target specific Name Server or IP ranges. Therefore, we construct four statistical features as follows: number of distinct A records (Feature 13), IP entropy of domain name (Feature 14), number of distinct NS records (Feature 15), and similarity of NS domain name (Feature 16).

Number of distinct A records (Feature 13) records the total number of IP addresses resolved in DNSDB. Furthermore, IP entropy of domain name (Feature 14) is constructed to measure the dispersion of these IP addresses resolved. Let d be a domain name, S be the set of these IP addresses resolved, and n be the number of distinct IP/16 prefixes in S. We define as IP entropy of domain name (Feature 14).

where means an IP/16 prefix in S, is the number of in S, and is the size of S.

If the IP entropy value of is greater, then more likely will be identified to be malicious.

Number of distinct NS records (Feature 15) records the total number of Name Servers resolved in DNSDB. Furthermore, Similarity of NS domain name (Feature 16) is constructed to measure the difference of these Name Servers resolved. We calculate the Edit Distance between every pair of Name Server names of a domain, and then the average of these distances is defined as the similarity of NS domain name. If the similarity of NS domain name of is bigger, then more likely will be identified to be malicious.

4. An Imbalanced Malicious Domains Detection Method

Almost all classification algorithms seem to be powerless to learn from an extremely imbalanced training data set. In consideration of the actual imbalanced distribution of DNS traffic data (i.e., malicious domains are relatively rare), inspired by existing methods, our research focuses on the combination of undersampling and ensemble learning.

In existing methods, EasyEnsemble [7] is a typical improved algorithm combining undersampling with ensemble learning. As we know, the main deficiency of undersampling is that potentially useful information contained in the unselected examples is neglected. To remedy this deficiency, EasyEnsemble incorporates ensemble learning into undersampling.

The idea behind EasyEnsemble is quite simple. Given the majority class instances set and the minority class instances set , this method independently samples several subsets from , where (). For each subset , a base classifier is trained using and . All base classifiers are combined for the final decision. Remarkably, many learning algorithms can be employed to generate the base classifier.

EasyEnsemble make better use of the majority class than undersampling by ensemble learning, so it is very helpful for between-class imbalance learning. However, EasyEnsemble ignores within-class imbalance, especially for the majority class. That is, in the majority class some instances are highly similar which may form several clusters, and more other instances are almost unique. This kind of phenomena is commonly called “long-tailed distribution” in the statistical sense.

We should select a representative subset from each cluster and combine them with a subset selected randomly from the other unique instances set to form a preliminary subset. According to this idea, we proposed an improved EasyEnsemble method to learn imbalanced DNS traffic data.

In this novel method, firstly the instances in the majority class are clustered together in several small groups by Hierarchical Agglomerative Clustering (HAC). For each cluster (), according to the size of , we select randomly several instances with a total of K. And then we select randomly -K () instances from - Σ to form a subset , where . Base classifier is trained using and . All base classifiers are combined for the final decision. Note that Decision Tree algorithm is employed to generate the base classifier.

The pseudocode of the improved EasyEnsemble named HAC_EasyEnsemble is shown in Algorithm 1.

(1) : A set of minority class examples , a set of majority class examples ,
, the number of subsets to sample from
(2) are clustered into several small groups . by HAC
(3)
(4)   repeat
(5)
(6)   Select randomly instances from each cluster () with a total of K
(7)   Select randomly -K instances from -
(8)   Combine the dataset sampled from step (6) and (7) to form a subset , where
(9)   Learn using and , is a base classifier employed Decision Tree
(10) until
(11) Output: An ensemble

Noted that here is an indicative function, and c is the class label, if the parameter of is true, then return 1, or else return 0. In HAC, we may employ various cluster proximity measures which are typically complete link, group average, Ward’s method [18], etc. For the complete link, the proximity of two clusters is defined as the maximum of distance (minimum of the similarity) between any two points in the two different clusters. For the group average, the proximity of two clusters is defined as the average pairwise proximity among all pairs of points in the different clusters. For Ward’s method, the proximity of two clusters is defined as the increase in the squared error that results when two clusters are merged [19].

5. Experiment

In order to verify the novel HAC_EasyEnsemble algorithm used to learn imbalanced DNS traffic data, we do a series of experiments to compare the performance of HAC_EasyEnsemble and EasyEnsemble based on the same dataset. And we use three different cluster proximity measures in HAC: complete link, group average, and Ward’s method.

Originally, we construct an imbalanced training set which contains 6400 benign domains (from alexa.com) and 3000 malicious domains (from cybercrime-tracker.net, malwaredomains.com, and hosts-file.net etc.). The reason for this ratio of malicious domains is that the HAC_EasyEnsemble algorithm is more effective for relatively rare malicious domains, not absolutely. The DNS resolving records of these domains are obtained by DNSDB API, and then 12 static lexical features and 4 dynamic DNS resolving features listed in Section 3 are constructed based on these records.

Commonly the evaluation measures for the imbalanced classification are macroaveraged precision, macroaveraged recall, macroaveraged F1 [20]. Since macroaveraged scores are averaged values over the number of categories, then the performance of classifier is not dominated by major categories. Let P be the precision, R be recall, and denote the total number of categories, then macroaveraged precision is , macroaveraged recall is , maro-averaged F1 is , where F1 is.

In order to get the number of base classifiers mentioned in Section 4 of HAC_EasyEnsemble classification model, we firstly do a series of experiments. In these experiments we set different for HAC_EasyEnsemble classification model, then we observe the error rate of classification in different . Figure 2 shows the relationship between the number of base classifiers of HAC_EasyEnsemble classification model and the error rate of classification.

From Figure 2, we can find that when the number of base classifiers equals approximately 10, the error rate of classification tends to be unchanged. Consequently, in the next comparing experiments, the number of base classifiers is set as 10.

Tenfold cross validation is performed on the experiment dataset. For this purpose, the corpus is initially partitioned into tenfold. In each experiment, ninefold data are used to train while onefold data are used to test. Ten experiment results are showed in Figure 3 and the average value of ten experiment results is reported in Table 2.

Figure 3 gets further insight about the comparison of complete link clustering, group average clustering, Ward’s method clustering, and nonclustering with line chart form, from which it can be seen that the scores with clustering are nearly higher than ones with nonclustering overall in each experiment. And Ward’s method is the best among of them in performance, while complete link and group average are almost in same level.

Table 2 shows the macroaveraged P, R, and F1 score of each scheme. For example, compared to nonclustering, the macroaverage F1 scores of Ward’s method clustering, of group average clustering, and of complete link clustering are approximately improved 3.5%, 2.6%, and 2.3%, respectively, and then we can draw a conclusion that sampling with HAC will be very helpful to improve the performance of classifier.

Finally, to find out whether the HAC_EasyEnsemble is able to show its advantage in different ratio of malicious domains, we do the other 6 experiments to compare the detection performance of HAC_EasyEnsemble. In the 6 experiments, the number of benign domains in training set is 6400, and the number of malicious domains is 700, 1000, 1500, 2000, 4000, and 5000, respectively. Figure 4 shows the experimental results.

From Figure 4, we can see that the detection performance of HAC_EasyEnsemble is almost in line with the previous 3000:6400 (see Figure 3 and Table 2) in any other ratio greater than 1000:6400(≈16%). So, if properly used, HAC_EasyEnsemble can be used to detect malicious domains by learning from imbalanced DNS traffic data.

6. Conclusions

In this paper, we proposed an improved version of EasyEnsemble for detecting malicious domains named HAC_EasyEnsemble, which can effectively deal with the within-class imbalance problem in tandem with the between-class imbalance problem, while EasyEnsemble can only deal with the between-class imbalance problem. The key idea of this improvement is to incorporate HAC into undersampling of EasyEnsemble, and three typical cluster proximity measures which are complete link, group average, and Ward’s method are also compared by experiments. Moreover, to profile malicious domains, we construct 12 static lexical features and 4 dynamic DNS resolving features based on passive DNS data from DNSDB. The comparative experiments show that the HAC_EasyEnsemble is superior for the malicious domains detection oriented to imbalanced DNS traffic. And it is worth emphasizing that this novel method is extremely suitable for the tasks in which enough malicious domains cannot be obtained in a limited amount of time.

We believe that HAC_EasyEnsemble is an effective method that can help us to cope with cybercrime. As future work, we plan to construct more discriminative features to profile malicious domains and further to enhance the performance of the HAC_EasyEnsemble algorithm.

Data Availability

The authors declare that the data used in our manuscript can be accessed by the following method. Firstly, the benign domain names are downloaded from alexa.com and the malicious domain names are downloaded from cybercrime-tracker.net, malwaredomains.com, hosts-file.net, etc. And then the DNS resolving records of these domains are obtained by DNSDB API with additional charge.

Disclosure

The funders (Dr. Xue and Dr. Shan) were involved in the manuscript editing, approval, or decision to publish.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was financially supported by National Key R&D Program of China (2016YFB0801304) and Scientific Research Project of Beijing Institute of Technology (2017CX02029).