Abstract

This paper investigates the performance enhancement of base classifiers within the AdaBoost framework applied to medical datasets. Adaptive boosting (AdaBoost), being an instance of boosting, combines other classifiers to enhance their performance. We conducted a comprehensive experiment to assess the efficacy of twelve base classifiers with the AdaBoost framework, namely, Bayes network, decision stump, ZeroR, decision tree, Naïve Bayes, J-48, voted perceptron, random forest, bagging, random tree, stacking, and AdaBoost itself. The experiments are carried out on five datasets from the medical domain based on various types of cancers, i.e., global cancer map (GCM), lymphoma-I, lymphoma-II, leukaemia, and embryonal tumours. The evaluation focuses on the accuracy, precision, and efficiency of the base classifiers in the AdaBoost framework. The results show that the performance of Naïve Bayes, Bayes network, and voted perceptron is highly improved compared to the rest of the base classifiers, attaining accuracies as high as 94.74%, 97.78%, and 97.78%, respectively. The results also show that in most cases, the base classifiers perform better with AdaBoost compared to their performance, i.e., for voted perceptron, the accuracy is improved up to 13.34%.For bagging, it is improved by up to 7%. This research aims to identify such base classifiers with optimal boosting capabilities within the AdaBoost framework for medical datasets. The significance of these results is that they provide insight into the performance of the base classifiers when used in the boosting framework to enhance the classification performance of classifiers in scenarios where individual classifiers do not perform up to the mark.

1. Introduction

Boosting in machine learning (ML) is achieved with the ability of the ML technique to boost the functionality of other classifiers when combined [1]. Boosting is a very effective technique for solving bi-class classification problems [2]. The boosting technique enhances the functioning and improves the correctness of any given learning algorithm by adding new modules. This procedure forms a new classifier, an ensemble of both classifiers with improved accuracy on a given training set [2]. For this reason, the name Boosting is assigned to such techniques as they boost the performance of other classifiers.

An example of a boosting technique, namely, adaptive boosting (AdaBoost), has a mechanism for training the data set based on allocating weights. Uniform weights are assigned to the datasets, and the probability of data selection is based on these weights. The training set, once classified accurately for one classifier, reduces the chance of the training set being utilized in the successive classifier [3]. Therefore, the selection of the training set is based on a classifier trained on it and the assigned weights. AdaBoost trains the classifier on beneficial, informative, and complicated patterns by iteratively running its algorithm. After each iteration, the training error is calculated, and the weights are allocated for the classifier.

Schapire and Freund purported the first solid algorithm for Adaboost, which is by AdaBoost’s methodology [4, 5]. Viola and Jones proposed an updated version of the AdaBoost technique by taking the weak classifiers with weak features [6, 7]. Therefore, the Viola and Jones version of the AdaBoost technique is a repetitive process combining multiple weak classifiers that approximate the base classifiers [8]. The AdaBoost classifier is mathematically expressed in the following equation:where represents a linear classifier, i.e., a linear sequence of all the constituent classifiers; and II ( (x)) represent the arguments for the base classifiers. AdaBoost has a huge margin and generalization capability, making its performance better than other boosting techniques. There are some limitations with the AdaBoost technique, such as considerations taken for each constituent classifier and the range that influences the performance simplification of the constituent classifiers [9]. AdaBoost has an accuracy and diversity calamity, which means attaining higher accuracy with two-component classifiers increases the chance for them to disagree. Maintaining a balanced trade-off between accuracy and diversity is required to accomplish an acceptable generalization for performance.

The rest of the paper is organized as follows: Section 2 provides details of the related materials, experimental setup, and methods. Section 3 gives details about the experimental setup. Similarly, Section 4 provides an insight into the results of the proposed algorithms and their comparisons with recent algorithms. Lastly, Section 5 concludes the paper and provides insight on the future enhancements.

2. Materials and Methods

As mentioned earlier, we conducted 60 experiments on five medical datasets with about 12 base classifiers. These results are prepared to analyze the base classifiers with the AdaBoost framework based on their percentage accuracy, error, precision, F-measure, and recall. Initially, the datasets are extracted and preprocessed to be employed by the machine learning algorithms. The data is cleaned by removing redundant information and empty cells, replacing missing records, and normalizing the data to be presented in a uniform format. Essential features are extracted from the data based on evaluating the ML techniques used to train the algorithms. The performance of the ML techniques is evaluated, and the most efficient ML techniques are selected for the allocation of suitable base classifiers in a given scenario. This process ultimately selects the optimal base classifier for a given problem. The proposed methodology is depicted in Figure 1 as a block diagram.

This diagram provides an overview of our proposed algorithm for the performance evaluation of base classifiers in the AdaBoost Framework for given medical datasets. The algorithms are monitored to attain the best accuracy. The model attaining the best accuracy is then deployed to classify the tumors for best performance in a given scenario. The details for base classifiers and the setups used for these base classifiers are given in Section 2.1.

2.1. Base Classifiers

The performance of AdaBoost with the base classifiers is evaluated in this research with twelve base classifiers in combination with AdaBoost in almost sixty experiments. The base classifiers are chosen from almost all the major categories of classifiers, such as Bayes, Functions, Rules, Networks, Trees, and Meta Functions. In this way, all the major classifier groups have been evaluated, and therefore, these experiments comprehensively illustrate the role of base classifiers in the AdaBoost framework. The details of the base classifiers are provided in the following sections.

2.1.1. Naïve Bayes

Naive Bayes (NB) classification is a supervised learning technique used as a statistical method for classification. NB has good performance in classification and pattern recognition [10]. NB acquired its name from the well-known Thomas Bayes theorem that categorizes the training set by opting for the class with the closest relation to the dataset. Bayes proposed that there is no connection between the existence and nonexistence of input features with each other [11]. All the properties are supposed to contribute independently and equally to the output probability. NB uses a maximum likelihood method for parameter estimation [12]. This is called an NB assumption that is mathematically illustrated in the following equations:where the classification value of the classes is given by C, P is the probability, E is the expectation, function (E) is the NB classifier, and are attribute values. Each attribute node has only one parent, i.e., the primary parent node C in NB. All attributes do not necessarily depend on other attributes in NB [13]. Hence, knowing class variables is enough in NB to conduct the classification procedure. Moreover, all the attributes are statistically independent and equally essential in NB. NB classifier requires only moderate training data for mean, variance, and other classification parameter approximation [14]. Depending upon the characteristics of the probability model in supervised learning settings, the NB classifier with the most significant value leads the hypothesis.

2.1.2. Voted Perceptron

Voted Perceptron (VP) is a linear classifier used under the supervised classification scenario. Frank Rosenblatt proposed the VP classification technique in 1957 [15]. VP works with the online learning technique by processing the training set individually and making predictions based on linear predictor functions. The weights are set as zero and initially act as parameter vectors. The VP algorithm stores the parameter vectors by passing over the training set [16]. The errors, for example, are handled with modifications in the parameter vectors on the fly. Hence, the technique for VP uses f(x) to map a single-valued input vector x to output y. The mathematical expression for the binary classifier VP is given in the following equation:

VP accumulates more information in the training phase, and improved predictions are generated with highly structured information on the test data. The VP algorithm uses the batch training mechanism for learning purposes, and it runs iteratively over the training set while waiting for it to locate a prediction vector [17, 18]. The prediction vector makes accurate learning on all the training sets and is used to estimate the labels on the training set.

2.1.3. Bayes Network

Bayes network (BN) is a simple network structure comprising nodes and edges, which was proposed in 1988 by Pearl [19]. The network assumes that every attribute, i.e., the leaf, is independent of every other attribute in the given state of the classifier [20]. Random variables represent the BN nodes, including variables, unknown parameters, observable quantities, or hypotheses. Disconnected nodes characterize the short-term independent variables, and the edges represent conditional dependencies. Additional edges are formed between the attributes in BN to grab correlations. All possible edge combinations that form the whole network must be searched by BN. The combined probability distribution of random variables is used for training, as shown mathematically in the following equation:

BN takes its shape in various kinds of acyclic networks depending upon the problem state for efficiently searching the whole network space. BN comprises the two-stage learning process of natural division, giving it a dual nature [21]. The first stage learns a network structure, and the second learns probability tables. BN has an influence diagram structure that represents and resolves the decision problems [22, 23]. BN forms sequences of variables known as dynamic BN in speech signals or protein sequence applications.

2.1.4. Decision Stump

A decision stump (DS) is a single-layered decision tree (DT) that makes it comparatively easier to build. Instances in DS are classified by assembling them with feature values [24, 25]. DS has a finite number of splits on the attributes, so only one attribute is necessary for its network. In DS classifiers, an instance feature to be classified is represented by a node, and a node value represents the corresponding branch [26]. The learning model of DS is based on a single internal root node. The architecture of the network is such that the root is instantaneously connected to the terminal nodes that make its leaves. The leaves further expand in the network and form a vast tree-like structure. Hence, it makes decisions based on a single input feature, also known as one-rules [27].

DS is often used as a module (called “weak learners”) in ML ensemble methods, such as bagging and boosting [27]. DS may be constructed with a leaf for each potential trivial feature, one for fitting with some chosen group and another for the rest of the categories. Stated scenarios are analogous to binary features and may be considered a different category if a feature is unavailable. Threshold levels are used for features to classify them into two different leaves if they are above or below the threshold value and multiple leaves with multiple threshold levels.

2.1.5. Random Tree

A random tree (RT) classifier was introduced by Cutler and Breiman to address regression and classification problems [28]. RT has a network of tree predictors, also known as an ensemble and decision splits, which efficiently model the data on attributes. The RT algorithm requires the input feature space, and each tree classifies inputs in the forest. Ultimately, the class label that makes the most votes is produced, and decisions are taken based on the weights assigned to the nodes.

RT is a very efficient decision algorithm that outperforms in terms of accuracy [29]. In the case of noisy data sets, this classifier is observed to have inadequate performance. If a significant portion of data is misplaced, RT approximates missing data with its inbuilt technique and retains accuracy.

2.1.6. Boosting and Bagging

Breiman developed the bagging technique as a procedure to enhance the performance of classification under ML methods [30]. The bagging title is attained from “bootstrap aggregating” terminology, as bagging is a clustering technique. Bagging generates individual classifiers for its clusters from each classifier based on a random distribution of the training dataset [31]. To create a classifier training set, the data is randomly taken with the replacement of examples, and the resultant data is equal in size to that of the original training set. Significant variations are reflected in the model from small changes in training data, which means this classifier has an unstable predictor. Bagging amalgamates multiple hypotheses with huge errors and generates a classifier with reduced errors in the training set.

Boosting is a collection of methods that primarily aims to produce and combine a series of classifiers [32]. It combines hypotheses generated by related learning methods that invoke various distributions of the training set. Boosting attains improvement in recognition for unstable classifiers and smoothing over discontinuities by similar. A boosting classifier is comparatively more prolific and efficient and has a trouble-free group learning approach [33]. Bagging and boosting have methods trained on dissimilar data established with Bootstrap, which resamples the original data. Bagging and boosting algorithms combine base classifiers whose outputs are assessed to determine the ultimate output.

2.1.7. Random Forest

Random forest (RF) is one of the finest classification techniques for massive data [34]. A large number of inputs do not affect the input variable when the RF technique is used. Compared to other techniques, it deals with huge amounts of data without much adverse effect on accuracy. The RF technique efficiently approximates the missing data, and because of this, it preserves accuracy even with a large amount of missing data. RF solves the issues with unbalanced data and balances the error in the class population [35]. This classifier has great potential to resolve the problems associated with vague data sets, which helps deal with unsupervised clustering, outlier detection, and data views. The RF method calculates the proximities between pairs of clusters that aid in locating the outliers in clustering. The RF technique also proposes an experimental method for detecting variable interactions in data [35].

2.1.8. ZeroR

ZeroR is based on a rule that works with a straightforward classification method on the target and prediction of the majority class. ZeroR ignores all predictors and relies on its decisions based on the majority of occurrences of an instance [36]. The predictability power of ZeroR is negligible, but it has a significant role as a standard baseline for other classifiers. ZeroR maintains a frequency table for the target class and selects it based on the majority frequency. ZeroR uses the most common class values for classification after identifying them. ZeroR is often employed as a baseline for other ML algorithms to evaluate their results since it returns a value for each instance.

2.1.9. J-48

J-48 is based on Quinlan’s C4.5 DT algorithm, which is the most frequently used. The J-48 technique has a much more common approach than DT, which divides the data into small subsets based on a decision criterion [37]. Leaves in J-48 represent similar subclasses, and the potential data gains are associated with the attributes based on the test. Instances are categorized in the same class or leaf they are associated with, and each attribute’s potential data is tested on attributes that provide the gain on data. Eventually, the selection parameter is used to select the best-suited attribute [38]. There are some limitations to the J-48 algorithm, such as empty branches, over-fitting, and insignificant branch problems, which must be resolved and handled well when working with J-48. Some solutions are proposed to these problems, such as adding RT and Kendall’s rank correlation (KRC) to J-48, besides many others that target the mentioned issues with J-48 and improve the overall performance [37, 38].

3. Experimental Setup

The experimental setup for this research is based on twelve base classifiers in combination with AdaBoost and five datasets. These data sets are mostly taken from medical problems such as various types of cancer. The datasets have many attributes and instances; as in medical problems, a large amount of information is required for making decisions. A brief description of the data sets is given in the following sections.

3.1. Data Collection and Preprocessing

The data is then preprocessed to prepare the training and testing data sets, including significant steps such as removing redundant data, data discretion and features construction, features selection, retrieval of missing records, separation of the testing and training sets, and data normalization. In the data preprocessing module, the data is analyzed for redundant and missing data and divided into training and testing sets.

3.1.1. Global Cancer Map

The global cancer map (GCM) is a dataset for multiclass cancer (MCC) diagnosis and is assessed using the technique of tumour gene expression signatures (TGES) [39]. There are about 16,064 attributes, each of which has 144 instances. The classifiers verify the output for fourteen classes: breast, colorectal, prostate, uterus-adeno, lung, renal, lymphoma, melanoma, mesothelioma, pancreas, bladder, leukaemia, central nervous system (CNS), and ovary. The decision is taken to classify the datasets into any of the aforementioned classes. Figure 2 shows the output distribution among these classes for a randomly chosen attribute.

3.1.2. Lymphoma-I

Lymphoma is a cancer that attacks the lymphocytes, a constituent part of the immune system. Typically, lymphoma is an undetectable solid tumour of lymphoid cells that affects the body’s immune system. Detecting lymphoma tumours is challenging and requires special methods such as gene expression profiling (GEP) to identify these tumours. The lymphoma-I dataset used in this research has two classes, i.e., anterior cruciate ligament (ACL) and granule cell layer (GCL), which are classified with medical diagnosis [39]. By GEP, distinct types of diffused large blood-cell lymphoma (DLBCL) are diagnosed. Numerous attributes are utilized to evaluate the results, essential for detecting lymphoma. There are 4,027 attributes, each of which has 45 instances. The output distribution between these classes for a randomly selected attribute is shown in Figure 3.

3.1.3. Lymphoma-II

In the lymphoma-II data set, the lymphoma is monitored for nine classes instead of two, i.e., NIL, DLBCL, ABB, GCB, RAT, RBB, FL, TCL, and CLL [39]. These nine classes are attributed to the GEP analysis. The GEP technique is used to classify distinct types of DLBCL. The lymphoma-II data set is employed to monitor various attributes for testing the results, which is essential for detecting lymphoma. There are 4,027 attributes, each of which has 96 instances. The data has been taken for its medical diagnosis by classifying it into nine classes; the output distribution for a randomly selected attribute is shown in Figure 4.

3.1.4. Leukaemia

Leukaemia is a bone marrow or blood cancer, while in general, it is also attributed to a wide range of diseases. Leukaemia is identified by an abnormal rise in immature white blood cells (WBCs), also known as blasts. Leukaemia-II is used to monito the molecular classification of cancer (MCC), which includes methods such as class discovery and prediction by GEP [39]. There are 7,130 attributes, each of which has 38 instances. The data has been taken for its medical diagnosis by classifying it into two classes, i.e., acute lymphoblastic leukaemia (ALL) and acute-myeloid leukaemia (AML). The output distribution between these classes for a randomly selected attribute is shown in Figure 5.

3.1.5. Embryonal Tumours

Embryonal tumour (ET) data is taken from the results of CNS for ET. It is monitored for the results of the GEP for the prediction of CNS ET. This data set has 7,130 attributes, each of which has 60 instances. The data is classified into positive results, denoted by one, and negative results, denoted by 0. ET data is taken for medical diagnosis based on genes [39]. The output distribution between these classes for a randomly selected attribute is shown in Figure 6.

3.2. Environment Setup for Machine Learning Classifiers

Certain parameters are considered as input features and are varied for different scenarios. In order to validate the model, 10-fold cross-validation is utilized for all machine learning methods. To increase efficiency, highly correlated input attributes are chosen as the data. Additionally, feature selection methods are employed to eliminate attributes that are not relevant to the output variable. To minimize the complexity of the model, a ridge regularization technique is applied, which prevents any coefficient from reaching an excessive value by reducing the sum of the squares of the learned coefficients. The distance between the data points is calculated using Euclidean distance, as the data is on the same scale. The data is searched and stored using a linear NN search method, and no windowing is required. The nearest neighbour is located through a linear search mechanism.

A learning rate of 0.001 is established, and the regularization parameter is adjusted based on the number of epochs. The regularisation parameter is reduced as the number of epochs increases to 1,012. Seventy percent of the total data is used for training, and thirty percent is used for testing. During the training phase, in the beginning, RMSE generally decreased as the number of nodes in the hidden layer increased, and then RMSE began to increase when the model started to over-fit. The early stopping criterion is used to avoid over-fitting. Various internal parameters are chosen by the trial and error method. The excess use of input variables usually has a negative influence because it decreases the processing speed and affects the redundancy contained in the different variables.

3.3. Performance Evaluation Metrics

The results for our research are collected from 60 experiments conducted on five medical datasets with about twelve base classifiers, as mentioned in earlier sections. Based on their percentage accuracy, error, precision, F-measure, and recall, these results are prepared to analyze the base classifiers with the AdaBoost framework. The percentage accuracy of the base classifier is displayed in terms of correctly classified instances against incorrectly classified instances for all twelve base classifiers. The precision of the base classifiers has been computed from the ratio of the examples that are truly classified as class x to the total number of examples. The values for recall are calculated from the proportion of examples classified as class x to the actual total in class x. Similarly, the F-measure is computed from the measure of precision and recall. The mathematical expressions of these performance evaluators are provided in the following equations:

4. Results and Discussions

The results are shown in Table 1, where the rows represent various classification algorithms and the percentage of correctly and incorrectly classified examples, while the columns represent data sets. The results show that the performance of Naïve Bayes, Bayes network, voted perceptron, and bagging as base classifiers in AdaBoost is better than the rest. These base classifiers outperformed in AdaBoost, attaining accuracies of 94.74%, 97.78%, 97.78%, and 93.33%, respectively, while their accuracies are lower, i.e., 84.44%, for voted perceptron and 86.67% for bagging technique. The results also show that in most cases, the base classifiers perform better with AdaBoost compared to their individual performance, i.e., for Voted Perceptron, the accuracy is improved up to 13.34%, and for bagging, it is improved up to 7%. Table 2 shows the precision, recall, and F-measure of the base classifiers in AdaBoost. Table 2 highlights the best precision values, i.e., 97.9% achieved by VP and BN. The finest recall values are highlighted in Table 2, which are 97.8% achieved by VP and BN. The best values for F- Measure are highlighted in Table 2, i.e., 97.8% achieved by VP and BN. The precision, recall, and F-measure for each base classifier are evaluated from comparisons between each base classifier so that their role in the AdaBoost framework is examined.

Using attributes to analyse the behaviour of base classifiers plays a prognostic role in identifying the best combination of base classifiers with AdaBoost. Hence, this performance evaluation provides an analytical model for choosing a base classifier in a given problem domain, making it easy to select an AdaBoost environment with a base classifier. As a future enhancement of proposed research, these observations can be used for some applications of AdaBoost, such as Viola-Jones object detection, and improve its performance with suitable base classifiers.

5. Conclusions and Future Work

Adaptive Boosting (AdaBoost), being an instance of boosting, combines other classifiers to enhance their performance. This boosting functionality of AdaBoost is highlighted in this work by monitoring the performance of several base classifiers with AdaBoost. Almost 60 experiments are carried out to observe the responses of twelve base classifiers on five significant medical data sets. The results of these experiments show that the AdaBoost framework attains better results with some base classifiers (Naïve Bayes, Bayes network, and voted perceptron) than other classifiers (J48, bagging, decision stump, random forest, and random tree). The reason is that base classifiers have a unique role in the AdaBoost classification. This research aims to track the unique role of the base classifiers in the AdaBoost framework and identify the classifiers with the best performance for the given medical dataset. The performance of the base classifiers is monitored in terms of their accuracy, precision, recall, and F-measure. The results show that the performance of Naïve Bayes, Bayes network, voted perceptron, and bagging as in AdaBoost is better than the rest of the base classifiers. These base classifiers outperformed AdaBoost, attaining accuracies of 94.74%, 97.78%, 97.78%, and 93.33%, respectively, while their individual accuracies are lower, i.e., 84.44% for voted perceptron and 86.67% for bagging technique. The results also show that in most cases, the base classifiers perform much better with AdaBoost compared to their individual performance, i.e., for Voted Perceptron, the accuracy is improved up to 13.34%, and for bagging, it is improved up to 7%. One of the limitations of this research is the application of the proposed algorithm to a dataset that belongs to a similar category, i.e., cancer data. Hence, as a future extension of this work, these experiments will be applied to other types of medical datasets, i.e., brain tumours, skin cancers, ECGs, medical imaging, clinical trials, Oasis, and CT datasets, and enhanced by taking some other applications of AdaBoost such as Viola-Jones object detection to improve its performance with the base classifiers.

Data Availability

The dataset could be made available on request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.