Abstract

High-accuracy alignment of sequences with disease information contributes to disease treatment and prevention. The results of multiple sequence alignment depend on the parameters of the objective function, including gap open penalties (GOP), gap extension penalties (GEP), and substitution matrix (SM). Firstly, the theory parameter formulas relating to GOP, GAP, and SM are inferred, combining unaligned sequence length, number, and identity. Secondly, we tested the rationality of the theory parameter formulas, with experiment on the ClustalW and MAFFT program. In addition, we obtained a group of MAFFT program parameters according to the formulas proposed. The results of all experiments show that the SPS (sum-of-pair score) obtained from theory parameters is better than the SPS obtained from the default parameters of ClustalW and MAFFT. In both theory and practice, our method to determine the parameters is feasible and efficient. These can provide high-accuracy alignment results for precision medicine.

1. Introduction

In 2015, US President Barack Obama stated his intention to fund a United States national “Precision Medicine Initiative” [1, 2]. A short-term goal of the Precision Medicine Initiative is to expand cancer genomics to develop better prevention and treatment methods. With the explosive growth of medical data, the complexity of disease, and the demand of personalized medicine, the research results of genome sequencing are changing the process of disease treatment. Multiple sequence alignment (MSA) is more and more important.

Multiple sequence alignment (MSA) has wide applications in sequence analysis, gene recognition, protein structure prediction, and reconstructing the phylogenetic tree [3]. Notredame [4] stated that the most modern programs for constructing MSA consist of two components: (1) an objective function to assess the quality of candidate alignment and (2) an optimization procedure for identifying the highest scoring alignment with respect to the chosen objective function. Currently, MSA has three main objective functions: (1) the sum-of-pairs score function (SPS), (2) the consensus function, and (3) the tree function. The SPS function is the most commonly used objective function, and its parameters include substitution matrix and gap opening penalties (GOP) and gap extending penalties (GEP).

The parameters of the objective function have generated many discussions on how to obtain optimal parameters. Thompson et al. [5] determined that substitution matrices vary at different alignment stages according to the divergence of sequences to be aligned. Residue-specific gap penalties and gap penalties in hydrophilic regions, which have been locally reduced, can cause new gaps to appear in potential loop regions rather than in a regular secondary structure. Reese and Pearson [6] discussed the relational formula between the PAM distance and PAM matrix as well as the gap penalty. Madhusudhan et al. [7] proposed the variable penalty formula according the structure of sequence based on dynamic programming. However, these formulas are not widely used. Gondro and Kinghorn [8] indicated that gap penalty parameters were determined by experience. At present, it is no theoretical framework to determine the optimum parameters. The current parameters pertaining to the objective function in most literature are empirical values which are independently associated with the sequences [9]. BALiBASE is a database of manually refined multiple sequence alignments [10] and is usually used to test performance of MSA method [11].

Many open source online alignment tools are available that can align hundreds of thousands of sequences in hours. These include CLUSTAL Omega, T-COFFEE, and MAFFT, [5, 1214] and often become the primary source of sequence alignment solution. However, these MSA tool results strongly depend on the gap penalty and substitution matrix. Different parameter combinations can obtain different MSA results. The majority of users use a single default parameter when applying these alignment tools, but the results are not the best. Moreover, an effective methodology has not yet been developed to directly determine an MSA optimal parameter, which means current online tools cannot guarantee the best solution. However, when compared with other MSA alignment tools, MAFFT has the advantage of simple input parameters and obtains better results than the other tools [12, 13]. This paper uses MAFFT as the basic experimental tool to verify the accuracy of the original formulas presented herein as they relate to the substitution matrix and the gap penalty.

2. Sum-of-Pairs (SP) Objective Function

The sum-of-pairs (SP) function is commonly used as an objective function for MSA and is derived aswhere the score is >0. When the score is higher, the accuracy of MSA is higher [15]. represents the total score of amino acid residues in the alignment sequence. is the total penalty score due to inserting gap and .

is calculated aswhere is the residue of the sequence, L is the length of the aligned sequences, and is the number of the sequences.

Cost is computed by a substitution matrix. Currently, two main kinds of substitution matrices are available: PAM and BLOSUM. The BLOSUM series applies to this research. In substitution matrices, are different from each other. When the residues are mismatched, are also different from each other. But, in the process of simplifying the calculation, we need to use a precise and representative numerical value to represent the characteristics of the matrix. The average value can be a good characteristic representing a group of different data. Therefore, using the average value of represents the match of the matrix and using an average value of represents the mismatch of the matrix.

The calculation of is divided into two categories: linear penalty and affine penalty. Linear penalty penalizes the same score for each gap. Affine penalty is commonly used because it is biologically meaningful [1618]. The gap is divided into two types: gap open penalty (GOP) and gap extension penalty (GEP), so the affine penalty formula is given aswhere is the number of GOP, is the number of GEP, and GOP > GEP.

3. The Theory Parameters Determination of SP Function for MSA

Symbol Description. The number of unaligned sequences is . The length of the longest sequence is . The length of the shortest sequence is . The mean identity is . The number of amino acid residues matched is .  After alignment, the number of gaps inserted into each sequence is .

Table 1 summarizes the ratio of the longest sequence and the number of gaps inserted into the sequence of each data set in BAliBASE 2.0 and BAliBASE 3.0. It shows that the number of gaps in the longest sequence is not more than 0.2 times the length of the longest sequence. That is, the number of gaps in each sequence is , and is the rounding function. Figure 1 shows how the sequence length and the number of gaps are related.

Figure 1 is an example. If , , and , the number of gaps inserted into the longest sequence is , and the ratio between the sequence and gaps is . The number of gaps in the sequence is . The number of gaps inserting the shortest sequence is , and the number of gaps in sequence is . The number of gaps in other sequences is .

The following parameter formulas are inferred according to information obtained from Figure 2. Figure 2(a) has the best state unaligned sequence. Each sequence has the same length and no gaps. The longest length of any unaligned sequence is 10, so the number of gaps inserted can go up to 2. Figure 2(b) shows the worst alignment results (inserting maximum gap and minimum matching). If the score of Figure 2(b) is higher than the score of Figure 2(a), the parameters of the objective function meet all cases of alignment, because the situation in Figure 2 is the worst alignment.

3.1. Substitution Matrix Theory Formula

According to (1), the SP score of unaligned sequences isand according to (1) and Figure 2(b), the following equations can be obtained:So, the SP score of the aligned sequences isIn theory, the alignment score must be greater than the unaligned sequence score,That is,Equation (9) can be simplified asThe formula of the substitution matrix is shown in (10), which can be simplified as

The rationality of the substitution matrix can be judged according to (11).

3.2. GOP and GEP Theory Formulas

Based on the affine penalty, is the number of gaps of each sequence; let us suppose that the number of gaps in each sequence is times as the number of GOP, so and . Because GOP > GEP, we accept that , where , is the positive integer, soAccording to (12), (9) can be expressed as follows:Equation (13) is the upper limit of GOP and the lower limit is .

If the upper limit of GOP is multiplied by weight coefficient and , the estimation formula of GOP iswhere , , and is a rounding function. is the length of the shortest sequence in the unaligned sets, and is the mean identity of unaligned sets.

The estimation formula of GEP isThe optimal value of each weight coefficients , , , , and in (14) and (15) can be obtained through the following experiments.

4. Simulation and Results

In order to test the rationality of the parameter formulas and determine the optimal value of each weight coefficient, we designed the following experiments on the BAliBASE 2.0 and BAliBASE 3.0.

4.1. Experiment Setting

BAliBASE version 2.0 [10] is an improved version, extended from version 1 with 167 reference alignments to over 2100 sequences, which also features eight reference sets. Because all the reference alignments of BAliBASE are aligned by the manual, it often used to test algorithms [1921]. Because our study is based on the global SP function, in this article, we used 113 reference alignments in References 1–3 as test objects. BAliBASE version 3.0 has the most widely used multiple alignment benchmark. The database contains 218 multiple protein sequence alignments, which have been divided into five reference sets. The first reference set includes equidistant sequences, whose identity is less than 20% (RV11) or between 20 and 40% (RV12) [22]. Other references have no similarity information. Because the formulas proposed in this paper need similarity of sequences, BAliBASE 2.0 and BAliBASE 3.0 (RV11 and RV12) were both used to establish data sets.

SPS (sum-of-pair score) works as an objective function, which can determine score increases if sequences are correctly aligned. If the SPS is higher, the results of alignment are close to the reference alignment and can be even better than the reference alignment [20]. To test the rationality of presented formulas and to determine the optimal parameters combination of MSA tools, the most popular alignment program, MAFFT [16], is used in this research. The alignment results are obtained through the Perl programming language. The MAFFT program has some advantages: (1) the number of MAFFT program parameters is less and is easy to control, using only substitution matrices, GOP and GEP, (2) through Perl, the MAFFT program can batch align, and (3) alignment accuracy is for the most part better than CW, MUSCULE, and TCOFFEE.

In our experiment, ,. The GOP step is 1, the GEP step is 0.2, and the substitution matrices are BLOSUM30, BLOSUM45, and BLOSUM62. For each group of sequences, through batch processing, the number of alignment results is 1,590 because there are 1,590 different combined parameter patterns.

4.2. Experiment Results
4.2.1. The Verification of Substitution Matrix Formula

This section shows how the rationality of the substitution matrix was established (see (11)). Figure 3 illustrates the calculated value and reference value of each of the three substitution matrices for Reference 2 (note: the other figures are similar to Figure 3). According to (11), when the reference value is greater than the reference value, the substitution matrix is rationality. It is shown that BLOSUM30, BLOSUM45, and BLOSUM62 meet the requirements of all sequences.

Table 2 lists the number of sequences meeting the substitution matrix sequence requirements (see (11)). It is shown that three BLOSUM substitution matrices meet all the sequences for References 1–3.

4.2.2. The Verification of Gap Penalty Formulas

Based on the SPS and MAFFT program (MAFFT-7.220-WIN64 version), we tested the rationality of (14) and (15). The optimum of GOP corresponded to the maximal SPS illustrated in Figure 4. From Figure 4, we can conclude the following: the GOP theory values inferred from (14) and (15) almost coincide with the optimal of GOP, so (14) can calculate the optimal value of GOP.

Table 3 statistics show the number of sequences in Reference 1 (Test 2), which meet the theory parameter requirements corresponding to SPS, which are greater than the default parameters corresponding to SPS. In Test 2, there are 24 sequences. Table 3 shows that when , , , and , the number of sequences is greater than , , , and . The best result is indicated in Blosum45, num 19, with an SPS of 0.8003 (in Table 3 set in bold face font). For Test 2 sequence sets, , is relatively rational and corresponds to = 0.05. The other sequence sets can also obtain the value of , , , , and , which are listed in Table 4.

4.2.3. Finding Optimal Value of Other Parameters in Derivation Formula

From the aforementioned experiments, we can determine the substitution matrix and , , and in (14). The other parameters are related to the sequences where is the ratio of GOP and , and . The number of GOP is limited and it will not increase too much, while the distribution of GEP is more concentrated. These parameters are more consistent with the biological characteristics of multiple sequence alignment.

Optimal parameters and the SPS value are listed in Table 4. The optimal value of weight coefficient in our proposed formula is located in Table 4. Using a weight coefficient, we can obtain the optimal of GOP, GEP, and MATRIX parameters. The number of sequences corresponding to SPS is also listed in Table 4.

Figure 5 shows that, for each SPS value sequence obtained from theory parameters, we inferred default parameters of MAFFT (MAFFT-7.220-WIN64 version) and CLUSTALW (CLUSTALW-2.1-WIN version). The SPS obtained by the MAFFT program are better than the CLUSTALW program on the default parameters. So we chose the MAFFT program as our test method. The SPS obtained by our theory parameters were better than the default parameters of MAFFT and CLUSTALW. Thus, the theory parameters we propose can optimize the results of MSA.

Table 5 shows the SPS mean values of References 1–3 sequences of BAliBASE 2.0 and RV11/RV12 of BAliBASE 3.0. The alignment sequences obtained from MAFFT default parameters, CLUSTALW default parameters, and MAFFT theory parameters are those proposed in this study. It is shown that SPS values obtained by MAFFT default parameters are better than SPS values obtained by CLUSTALW default parameters. The SPS values obtained using our theory parameters are the best. So, the theory parameters optimized the results of MSA.

5. Conclusions

This paper clearly shows that the parameters of MSA tools influence MSA results. These parameters not only include substitution matrices, GOP, and GEP but also include the length, number, and identity of sequences. Our goal was to find a group of combined optimal parameters. Based on the SP function, we established a series of formulas which can determine the value of substitution, GOP, and GEP. In order to test the rationality of the formulas, our experiments were conducted in the MAFFT program base or in the BAliBASE 2.0 and BAliBASE 3.0 (RV11 and RV12) database. Moreover, we obtained the optimal value of the substitution matrices, GOP and GEP, and these values proved to be better than the default values of the MAFFT program. After the theory analysis and experimental analysis, we can conclude that the proposed method can effectively solve the MSA parameter problems and improve MSA accuracy, which can provide more accuracy information for precision medicine in disease analysis and prediction.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

ManZhi Li and HaiXia Long contributed equally to this work. ManZhi Li and HaiXia Long carried out the multiple sequence alignment parameters studies, participated in the experiments, and drafted the manuscript; these authors contributed equally to this work. HaiYan Fu and HongTao Wang participated in the design of the study and performed the statistical analysis. All authors read and approved the final manuscript.

Acknowledgments

This work was supported by the China Scholarship Council, the National Natural Science Foundation of China (no. 61762034, no. 71461008, no. 61663007, and no. 61163042), and the HaiNan Province Natural Science Foundation (no. 614235, no. 617122, and no. 20166222).