Abstract

There are a lot of bacteria in the environment, and Gram-positive bacteria are the most common ones. Some Gram-positive bacteria are very harmful to the human body, so it is significant to predict Gram-positive bacterial protein subcellular location. And identification of Gram-positive bacterial protein subcellular location is important for developing effective drugs. In this paper, a new Gram-positive bacterial protein subcellular location dataset was established. The amino acid composition, the gene ontology annotation information, the hydropathy dipeptide composition information, the amino acid dipeptide composition information, and the autocovariance average chemical shift information were selected as characteristic parameters, then these parameters were combined. The locations of Gram-positive bacterial proteins were predicted by the Support Vector Machine (SVM) algorithm, and the overall accuracy (OA) reached 86.1% under the Jackknife test. The overall accuracy (OA) in our predictive model was higher than those in existing methods. This improved method may be helpful for protein function prediction.

1. Introduction

The cell is the most basic unit of life, and it contains many protein molecules. When a protein is in the right subcellular position, it can perform the right function [1]. So, studying protein subcellular location can help us better understand the biological function of proteins at the cellular level. In the postgenetic era, the amount of biological information has grown rapidly and the traditional experimental method became time-consuming and exhausting. So, the prediction of protein subcellular location based on the machine method has gradually become a hot research topic in bioinformatics [27].

Gram-positive bacteria are those that retain their original blue-violet color after being stained by Gram staining. Gram-positive bacteria exist widely in the human body, and they are harmful to the environment and human health. So, it is important to study the protein subcellular location of Gram-positive bacteria. There are a few researches on the protein subcellular location of Gram-positive bacteria. In 2007, Shen and Chou [8] established a Gram-positive bacteria dataset of five categories. They used the GO-PseAA discrete model and the Fusion OET-KNN method, and the overall success accuracy was 82.7% with the Jackknife test. In 2009, Shen and Chou [9] rebuilt the Gram-positive bacteria dataset with four categories: cell wall, cell membrane, cytoplasm, and extracell. The feature of gene ontology information and functional domain information were extracted, and the total success accuracy reached 82.2% with the Jackknife test. In 2012, the total success accuracy was 85.9% for the GP25 dataset constructed by Hu et al. [10]. In the 9th international conference on electrical and computer engineering, Rahman et al. [11] proposed two hybrid features, AACPPM and PAACPPM, which combined PPM with AAC and PseAAC, respectively. The accuracy of both AACPPM and PAACPPM were 73.2%. In 2017, Xiao et al. [12] took advantage of the dataset established by Shen and Chou in 2009 and applied the new algorithm, and a better result was obtained. In 2018, Xiao et al. [13] developed a new bias-reducing predictor. The results showed that this predictor was very helpful in predicting the training dataset.

In this paper, we reconstructed the Gram-positive bacterial protein subcellular location dataset. The amino acid composition information [14], the amino acid dipeptide composition information [15, 16], the gene ontology [17] annotation information, the hydropathy dipeptide [18] composition information, and the autocovariance average chemical shift [19] information were selected as characteristic parameters, then these parameters were combined. Finally, the overall accuracy in the Jackknife test was 86.1% by using the combined parameter AAC+DC+hpDC for the Support Vector Machine.

2. Materials and Methods

2.1. Dataset

In order to collect as much desired information as possible while ensuring a high quality for the dataset, the protein sequences were collected from the Swiss-Prot [20] database at http://www.uniprot.org/. The dataset was established in strict accordance with the following criteria: (1) We conducted a search for all protein sequences with “actinobacteria” and “firmicutes” in the OC firmicutes from the UniProtKB/Swiss-Prot database. (2) Different locations of the protein in the “Subcellular Location” annotation were selected, and the ambiguous or uncertain terms, such as “By similarity” and “Probably” were removed. (3) The protein sequence of 50 aa-3000 aa in the “Sequence” information were selected. (4) Sequences annotated by two or more locations were not included. (5) Sequences annotated with “fragment,” “B,” “X,” and “Z” were excluded. (6) To avoid any homology bias, the software CD-HIT [21] was used to winnow those sequences which have ≥25% sequence identity to any other sequence in the same subcellular location.

After completing the above steps, we obtained 700 Gram-positive proteins, and the specific distribution is shown in Table 1.

2.2. Amino Acid Composition (AAC)

The sequence information of proteins is the most basic feature information of all characteristic parameters [22]. The protein sequence consists of 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y). The feature of the occurrence frequency of the 20 amino acids in the protein is important. So, the occurrence frequency of the 20 amino acids in the protein sequence can be selected as one of the characteristic parameters. The amino acid composition can be expressed as a 20-dimensional feature vector:

where , is the occurrence number of the 20 native amino acids of the protein, is the length of the protein, and is the transpose operator.

2.3. Dipeptide Composition (DC)

One of the main drawbacks of the amino acid composition is that it only emphasizes on overall sequence information but ignores the sequence order information. In order to make full use of the sequence information of amino acids, we proposed using the amino acid dipeptide composition information. The amino acid dipeptide information is an improvement based on the AAC parameter, and it denotes the frequency of two adjacent amino acids in a 400-dimensional vector [2325]. The dipeptide composition can be formulated as follows:

where is the absolute occurrence frequencies of the 400 dipeptides and calculated by where is the occurrence number of the 400 dipeptides of the protein and is the length of the protein.

2.4. Gene Ontology (GO)

Gene ontology is a directed acyclic graph ontology widely used in bioinformatics, and gene ontology consists of three parts: biological process (P), molecular function (F), and cellular component (C). In the gene ontology database, we found that each AC number has a corresponding GO identification number: XXXXXXX. In this paper, since cellular component (C) contains the location information of a protein, in order to ensure the accuracy of the prediction, only biological process (P) and molecular function (F) were extracted.

The specific steps are as follows:

Step 1. The “Text” documents of all protein sequences were downloaded in Swiss-Port, and the annotation information of all biological processes (P) and molecular functions (F) was extracted.

Step 2. BLAST [26] was used to find homologous sequences of biological process (P) and molecular function (F) without annotation information. The homology threshold was set to 60%, and the value was set to 0.001.

Step 3. The frequency of occurrence of each GO term was calculated: where denotes the frequency of the th GO terms at the position of Gram-positive bacteria and is the total number of amino acid sequences at the position of Gram-positive bacteria. A threshold valuewas set; when , the corresponding GO terms were retained.

Step 4. The GO terms of all target sequences were integrated and repeated, then 2573 GO terms were acquired. Finally, the 2573 GO terms were integrated into one vector, : where is 0 or 1, and the GO number with the corresponding location information of the proteins was set to 1; otherwise, it was 0.

2.5. Autocovariance Average Chemical Shifts (acACS)

The most important issue is how to extract features from primary sequences of a protein in a predictor. Hence, the acACS [27, 28] algorithm that uses simple secondary structure information to represent the sample of a protein was proposed. The average chemical shift of a protein is closely related to the protein’s secondary structure [29] and the function of this protein. The secondary structure of the protein sequence (C, H, and E) was obtained by submitting the protein sequence to the PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) online tool, and then the secondary structure was submitted to Fan et al.’s [30] average chemical shift service website acACS (http://wlxy.imu.edu.cn/college/biostation/fuwu/acACS/index.asp) to obtain the results of the chemical shifts. For a protein where means the length of the protein sequence and is the 20 amino acid residues; thus, can be expressed as follows: where represents the correlation factor of the average chemical shift for with the average chemical shift for along the protein sequence. The factor means the rank of correlation. The factor can be represented in a different composition of , , , and . In order to obtain the best accuracy, an appropriate number factor and the best combination mode were selected to predict the results.

2.6. Hydropathy Dipeptide Composition (hpDC)

Hydropathy dipeptide composition is based on the improvement of hydrophilic and hydrophobic proteins. Firstly, 20 kinds of amino acids were divided into 6 categories [31] according to the hydrophilic and hydrophobic standards, namely, strong hydrophilic amino acids (H), strong hydrophobic amino acids (L), weak hydrophilic amino acids or weak hydrophobic amino acids (W), and three types of proline (P), glycine (G), and cysteine (C) with special chemical structures. Hydrophilic and hydrophobic dipeptide composition is a discrete method that uses protein sequence representation, and it can be represented as a 36-dimensional vector: where represents the occurrence frequencies of the 36 hydropathy dipeptides, while denotes the occurrence number of the 36 hydropathy dipeptides of the protein and is the length of the protein.

2.7. Support Vector Machine (SVM)

The Support Vector Machine is a machine learning method to solve classification and regression problems based on statistical principles. The SVM model is a representation of the examples as points in space, mapped by a kernel function so that the examples are divided by a clear gap that is wide enough. The new examples are mapped into the same space and predicted according to which side of the gap they fall on. The radial basis kernel function (RBF) was used to obtain the best classification hyperplane. The regularization parameter and the kernel width parameter were tuned via the grid search method. So far, the risk minimization of the SVM algorithm has become the latest research hotspot and it has been successfully applied to various fields [3238], especially in the field of biological computing, such as in the prediction of protein sequence structure and in the classification of protein structure [28, 3946]. In this paper, the LIBSVM algorithm has been used to predict various feature information, which can be downloaded from http://www.csie.ntu.edu.tw/cjlin/libsvm/.

3. Results

3.1. Cross-Validation

In statistical prediction, three test methods of prediction accuracy are used: the Jackknife test, the-fold cross-validation test, and the independent test [8, 4753]. In this paper, a strict and objective method for the Jackknife test was adopted to examine the performance of the proposed model. The principle of the Jackknife test is to select one from among all protein sequences as a testing set and the other remaining sequences as a training set until all protein sequences are recycled once.

3.2. Evaluation of the Predictive Performances

In order to evaluate the performance of related predictive methods and the reliability of the algorithm, the sensitivity (), specificity (), accuracy (ACC), Matthew’s correlation coefficient (MCC), and overall accuracy (OA) [5459] were used and defined by where is the total number of protein sequences in the dataset, TP represents the numbers of the correctly recognized positives, FN is the numbers of the positives recognized as negatives, FP means the numbers of the negatives recognized as positives, while TN is the numbers of correctly recognized negatives.

3.3. The Prediction of Gram-Positive Bacteria

In this paper, in order to investigate the effectiveness of our approaches, we have used five feature extraction strategies and the SVM is used as classification algorithm.

The autocovariance average chemical shift (acACS) vectors were formed based on protein sequence, and in order to obtain the best prediction results, we need to find the best chemically shifted atom combination and the best parameter . Figure 1 shows that the predicted results for ranges from 0 to 56, and the best is 40. Figure 2 shows that the prediction result was the best when the combination mode of chemically shifted atoms was . For gene ontology information, the first 2573-dimensional vector was obtained. Since the redundancy of data has a detrimental effect on the prediction results, we used the method of principal component analysis to reduce the vector to 854 dimensions. First of all, the 2573 GO terms were integrated into one vector, then the frequency of each GO term was counted. According to the sum of frequencies, the first 854 data was selected.

The predicted results by the Jackknife test for the different information parameters are recorded in Table 2, and the predicted results based on the combined parameter information with the Jackknife test are shown in Table 3. The results showed that the combined parameters were better than a single characteristic parameter. And the combined parameter AAC+DC+hpDC obtained the best accuracy which was 86.1%. The results indicated that the combined parameter was helpful to predict the protein subcellular location of Gram-positive bacteria. The reason that the accuracies of AAC+GO+acACS+hpDC, AAC+DC+GO+hpDC, and AAC+DC+GO+acACS+hpDC were lower than AAC+DC+hpDC was probably due to the redundancy of data.

4. Discussion

For the purpose of comparing the predictive capability of our method, the predicted results of Shen’s, Hu’s, and Julia Rahman’s method are enumerated in Table 4. It can be seen from Table 4 that our results were superior to others. The accuracy of our method was 3.4% higher than Shen’s first work, 3.9% higher than Shen’s second work, 0.2% higher than Hu’s work, and 12.9% higher than Julia Rahman’s work.

Gram-positive bacteria exist widely in nature and could cause many diseases, so studying Gram-positive bacteria subcellular location could solve the many problems of disease. In this paper, the dataset of protein subcellular location of Gram-positive bacteria was reconstructed, and the subcellular location of Gram-positive bacterial protein was predicted. The method in this paper had the advantages of a simple algorithm and an automatic process. The results showed that the combined parameter can improve the prediction accuracy of protein subcellular location of Gram-positive bacteria.

The protein data used to support the findings of this study are included within the supplementary information file.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest.

Authors’ Contributions

Gao XW conceived the selection of feature parameters and carried out the computation by SVM. Li FM analysed the results and wrote the manuscript. All authors reviewed the manuscript.

Acknowledgments

This work was supported by the Natural Science Foundation of Inner Mongolia of China (2019MS03015) and the National Natural Science Foundation of China (31360206).

Supplementary Materials

The protein data used to support the findings of this study are included within the supplementary information file. (Supplementary Materials)