Protein Local Tertiary Structure Prediction by Super Granule Support Vector Machines with Chou-Fasman Parameter

Received Dec 10 th , 2011 Revised Jan 20 th , 2012 Accepted Jan 30th, 2012 Prediction of a protein's tertiary structure from its sequence information alone is considered a major task in modern computational biology. In order to closer the gap between protein sequences to its tertiary structures, we discuss the correlation between protein sequence and local tertiary structure information in this paper. The strategy we used in this work is predict small portions (local) of protein tertiary structure with high confidence from conserved protein sequences, which are called “protein sequence motifs”. 799 protein sequence motifs that transcend protein family boundaries were obtained from our previous work. The prediction accuracy generated from the best group of protein sequence motifs always keep higher than 90% while more than 8% of the independent testing data segments are predicted. Since the most meaningful result published in latest publication is merely 70.02% accuracy under the coverage of 4.45%, the research results achieved in this paper are obviously outperformed. Besides, we also set up a stricter evaluation to our prediction to further understand the relation between protein sequence motifs and tertiary structure predictions. The results suggest that the hidden sequence-to-structure relationship can be uncovered using the Super Granule SVM Model with the Chou-Fasman Parameter. With the high local tertiary structure prediction accuracy provided in this article, the hidden relation between protein primary sequences and their 3D structure are uncovered considerably. Keyword:


INTRODUCTION
Proteins are used by organisms for virtually every life function.Understanding the relationship between the amino acid sequence and the resulting protein structure is one of the most important research topics: First of all, based on many biochemical experiments, it is believed that a sequence is the sole determinate in a polypeptide"s structural conformation.Most proteins have just one shape for their lifetime, but a handful --in particular: proteins associated with viruses such as HIV, the influenza virus, and with alpha-synuclein (protein involved in Parkinson disease) --have two dramatically different shapes; one before the disease attacks and one after.Second, the function of a protein is directly dependent on its three-dimensional structure.Last but not least, structural-based drug design in the medical field relies heavily on protein tertiary structural information which is usuallyobtained from expensive X-ray crystallography or NMR spectroscopy.
Sequence motifs are referred to as the conserved sequence patterns either functionally or structurally similar in a group of related proteins.The role of motifs is in predicting functional or structural portion of other proteins including prosthetic attachment sites, enzyme-binding sites and DNA /RNA binding sites, and so on.Even though the discovery of new motifs requires tremendous time and effort, the modification of known motifs and the generalization of new motifs are major issues in academia.Protein sequence motifs are usually categories into families.The signatures can be derived as complex descriptors, or simple consensus patterns, such as blocks or profiles [1].Some popular motifs databases include PROSITE [2], BLOACK [3], PRINTS [4], SBASE [5], and PFAM [6].In terms of techniques, protein sequence motif discovery tools such as MEME [7], Gibbs Sampling [8], Block Maker [9], MITRA [10] and Profile Branching [11] are extensively adopted by the bioinformatics communities.These applications, however, suffer a common issue of limiting the size of input dataset.Consequently, little information that crosses family boundaries can be discovered by these databases and tools.In order to find out protein sequence motifs information that crosses family boundaries, the input dataset need to be big enough to cover all representative sequences for all known protein sequences.As a result, efficient techniques are demanded.Clustering is one of the most popular data mining techniques and has been studied extensively for protein sequence motifs discovering [13,14,16,[20][21][22][23][24][25][26][27].Han et al produced high quality protein clusters from protein sequence frequency profiles [13,21] using the K-means clustering algorithm.These recurring patterns were regarded as vocabularies to understand the whole sentence encoded in protein structure.Subsequently, they used the sequence clusters combined with Hidden Markov Model (HMM) [28] to predict local protein structures.However, these conventional clustering algorithms assumed that the distance between data points could be calculated.While the distance function was not well characterized, this approach might not reveal the true sequence-to-structure relationship [30].
Support Vector Machines (SVMs) [31] have established their importance in various research fields.SVMs implement the soft margin concept to bear mislabeled examples for the purpose of maximizing the margin.Therefore, SVMs are capable of handling non-linear classification by implicitly mapping input samples into a higher dimension for maximum-margin hyperplane generation.Under this point of view, the SVM can be more efficient to discover the non-linear sequence-to-structure relationship than the K-means clustering algorithm [30].However, applying the SVM to this problem is not feasible because of the high computational cost of the SVM algorithm [17].It is almost infeasible to model a SVM over half a million data segments, which is the necessary requirement for generating protein sequence motifs that cross protein family boundaries.However, combining the SVM and the granule computing allows for uncovering the unknown behind the sequence-to-structure relationship.
Recently, Zhong et al [30] proposed the Clustering SVM for protein local tertiary structure prediction.With an aim to evaluate recurring pattern quality, 3D information including RMSD and Torsion Angle are integrated in the motifs evaluation process.Our research goal is to reveal the correlation between protein primary sequences and the structures; As a result, none of 3D information is included during the generation of protein sequence motifs.In this paper, we explain how to combine granule computing, the SVM and, the Chou-Fasman parameter to achieve our research goal.A detailed report on local protein structure prediction based on sequence information is also provided.

SUPER GRANULE COMPUTING MODELS
Super Granule Support Vector Machine (Super GSVM) with Chou-Fasman parameter is a new model specifically designed for protein local tertiary structure prediction.It is founded on the FGK model [16] and the Super GSVM-FE model [23].In this section, we explain the FGK model and the Super GSVM-FE model, and then propose the Super Granule Support Vector Machine (Super GSVM) with Chou-Fasman parameter model.

The FGK Model for Protein Sequence Motifs Discovery
Granular computing represents information in the form of aggregates, also called "information granules" [17,18].For a huge and complicated problem, it uses the divide-and-conquer concept to split the original task into several smaller subtasks to save time and space complexity Also, in the process of splitting the original task, it comprehends the problem without including meaningless information.As opposed to traditional data-oriented numeric computing, granular computing is knowledge-oriented [18].
A granular computing based model called "Fuzzy-Greedy-Kmeans model" (FGK model) is proposed in our previous work [16].This model works by using FCM to building a set of information granules and then applying our new greedy K-means clustering algorithm to obtain the final information.The basic idea of FGK model is showed in Figure 1.The greedy method collects five traditional K-means results and then selects the initial centroids based on those results.Due to the fact that the centroids in higher quality clusters have the potential to generate better clusters in the sixth round, we divided our selection initial centroids procedure into five steps: initially gathering centroid seeds belonging to clusters with structural similarity greater than 80% and then proceeding with 75%, 70%, 65% and 60%.Major advantages of the FGK model are reduced time-and space-complexity, filtered outliers, and higher quality granular information results.

The Super GSVM-FE Model for Protein sequence Motifs Extraction
Basically, this new model is the next generation of the FGK model.It also uses the fuzzy concept to divide the original dataset into several smaller information granules.For each granule, after five iterations of traditional K-means clustering, the greedy k-means is applied.The next step is different from the FGK model: we adapt ranking SVM (2002 et al, 2002) to rank all members in each cluster generated by the greedy K-means clustering algorithm, and then we filter out lower ranked members.The number of segments to eliminate is decided by a user defined filtrate percentage.The results of different percentage are discussed in [23] and "20%" provides the best tradeoff value.After the feature elimination step, we collect all surviving data points in Finally, we collect all the results in all granules to create the final protein sequence motif information.Figure 2 demonstrates the Super GSVM-FE model.

The Super Granule Support Vector Machine (Super GSVM) with Chou-Fasman Parameter Model for Protein Location Tertiary Structure Prediction
The sketch of the proposed model has been shown in Figure 3 and 4. The whole model can be divided into two parts: 1. Generate and Extract protein sequence motifs generated mainly from primary sequence information (Figure 3); 2. Predict protein local tertiary structures through the obtained motifs.(Figure 4).
In order to discover protein sequence motif information which is universally conserved across protein family boundaries, our original input dataset is extremely large.Therefore, an efficient granule computing technique is applied: Fuzzy C-means clustering algorithm is utilized as the first step to softly divide the huge training dataset into 10 smaller information granules.For each information granule, we then carry out the Greedy K-means clustering algorithm [16], which performs the traditional K-means clustering five iterations and then brings together the good clusters" centroids as the starting centroids for the sixth round.343 among 799 clusters are considered meaningful recurring patterns (for more information including parameter setup and detail results, lease reference [16]).After the quality evaluation, the Chou-Fasman parameter is calculated and appended to all data segments.Since the size of the clusters (the average size of the clusters is 905.75 members) is much smaller than the original training dataset (more than half million data segments), we are able to train the Ranking-SVM based on secondary structure for each cluster.Based on the trained Ranking-SVM models, we generate the rank of all members within the cluster.The research results in [23] have shown that eliminating 20% of the lower ranked members for each cluster generates the optimal protein sequence motifs information in the biological and biochemical perspective.Thus, we purge 20% of the lower ranked members from each cluster resulting in 536 out of 799 meaningful recurring patterns.To conclude the first part of the model, we collect all extracted recurring patterns for the next part of the model: local tertiary structure prediction.It is important to note that during the first part of the model, none of the 3D information is involved.After the sequence motifs are formed, for each cluster, we use all members" 3D structure to calculate the represented 3D structure of the cluster.3D information is only appended after the cluster is generated and extracted.Our objective is to anticipate the similar 3D structure of discovered protein sequence recurring patterns and independent testing dataset on the basis of similarity shared in primary sequence.
The second part of the Super GSVM with Chou-Fasman parameter model is straightforward: for each independent testing sequence segment, we first append its Chou-Fasman value and then calculate the total distance (including the difference of primary sequence and the Chou-Fasman value) by formula (2) with all sequence clusters.Due to the fact that the protein sequence motifs we discovered are transcend protein family boundaries, we can directly search for a match without pre-processing the testing dataset into protein categories or families.If we find a closest cluster within a given distance threshold, we can say that the testing segment is close enough to our discovered sequence motif and it should have a similar tertiary structure to the representative 3D structure of the discovered sequence motif.Needless to say, how to setup this threshold is a research problem.The stricter threshold we set, the higher prediction accuracy should be achieved.However, the stricter threshold we set, the fewer testing segments can be predicted.Detail results related to the threshold, the prediction coverage and the prediction accuracy are showed in section 4. Due to the fact that sequence motifs, by definition, only occur in a limited number of positions within a proteins sequence, we emphasize "local" tertiary structure prediction [29] instead of complete tertiary structure prediction.Detailed experimental results are provided in the results section.

Training dataset
Since the major purpose of this work is to obtain protein sequence motif information across protein family boundaries, the dataset of our work is supposed to represent all known protein sequences.However, without a systematic approach, it is very difficult to extract useful knowledge from an extremely large volume of data.The basic principle we use is to choose representative protein files from the whole PDB database, and then use the profile in HSSP to expand each file.The dataset used in this work includes 2710 PDB protein sequences obtained from Protein Sequence Culling Server (PISCES) [19].Among these 2710 protein sequences, no sequence in this database shares more than 25% sequence identity.HSSP is a derived database merging structural (3-D) and sequence (1-D) information.For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family [35].In the end of each HSSP file, it calculates the occurrence percentage of every amino acid on each sequence position.An example of the 1b25 HSSP file is given in Figure 5.The sliding window technique with nine successive residues is generated from protein sequences.Each window represents one sequence segment of nine contiguous positions.More than 560,000 segments are generated by this method.Figure 6 shows how we apply the sliding window technique on the 1b25 HSSP file.Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix.Twenty rows represent 20 amino acids and 9 columns represent each position of the sliding window.For the frequency profiles (HSSP) representation for sequence segments, each position of the matrix represents the frequency for a specified amino acid residue in a sequence position for the multiple sequence alignment.

Independent testing dataset
The latest release of PISCES includes 4345 PDB files.Compared with the dataset in our experiment (obtained in 2005), 2419 PDB files are excluded.In this experiment, we use the protein sequence motifs information generated from our old dataset (2710 protein sequence files) to predict the tertiary structure of these 2419 protein files.Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset, which generates around 486,234 segments by the sliding window approach.

The source of secondary and tertiary structure information
We also obtained secondary structure from DSSP [34], which is a database of secondary structure assignments for all protein entries in the Protein Data Bank, for each sequence segment.The main uses of secondary structure information are to evaluate sequence clusters and train the ranking SVM.Originally, DSSP allocates the secondary structure to eight different classes.However, in this study, those eight classes are reclassified into three categories according to the following conversion model: assigning H, G, and I to H (Helices), assigning B and E to E (Sheets), and assigning all others to C (Coils).The tertiary structure of protein sequence segments in the training set and testing set are available from Protein Data Bank (PDB).
In the Super GSVM with Chou-Fasman parameter model, Chou-Fasman parameter is encoded right after the protein recurring patterns (clusters) are generated and the testing data are read-in.The encoded value is computed as follows.For each location within one window size, we calculate the propensity value for helices, sheets and turns.Since the window size we select in the paper is 9 and 3 and different secondary structures are considered, an additional 9 × 3 information segment is added after the encoding procedure of Chou-Fasman parameter.As we previously mentioned, for each location within a window size, HSSP provides the probability of each amino acid to be appeared.Since the Chou-Fasman parameter (Table 1) provides the relative value for secondary structure determination, if we sum up the twenty cross value of the probability of each amino acid and its corresponding helices value in Chou-Fasman parameter, we can determine the total helices value.Sheets and turns (or coil) share the same trends.For example, if a sequence with A (10%), R (2%), D (20%)…, and the total helices value equals to 10%*142 + 2%*8 + 20%*101 + … and so on.Sheets and turns (or coil) share the same trends.

Distance Measure
Since the Manhattan distance is featured by every position of the frequency profile equally, this distance measure is the most suitable measurement for this research [13].The following formulation is adopted to obtain the distance between two sequence segments [13].
, , Where L is the window size and N is 20 representing 20 different amino acids.Fk(i,j) is the value of the matrix at row i and column j andrepresents the sequence segment.Fc(i,j) is the value of the matrix at row i and column j andrepresents the centroids of a give sequence cluster.The lower the dissimilarity value, the higher similarity the two segments have.

Distance Measure together with Chou-Fasman Parameter
City block distance measure is still valid after the Chou-Fasman parameter is encoded in each sequence segment.The following formula is used to calculate the similarity of two sequence segments: Where ws is the window size and Pi,H shows the frequency of occurrence of helix among the segments for the cluster in position i.Pi,E and Pi,C are defined in a similar way.If the structural homology for a cluster exceeds 70%, the cluster can be considered structurally identical [15].If the structural homology for the cluster is between 60% and 70%, the cluster can be viewed weakly structurally homologous [20].

Tertiary Structure Distance (dmRMSD)
In this research, we use "Distance Matrix", which is the mutual distance among Cα carbans, to represent the real 3D structure as well as predicted 3D structure.The distance matrix to represent the structural segment stores the distance from the first atom"s Cα carban to the second atom"s Cα carban, the distance from the first atom"s Cα carban to the third atom"s Cα carban, …, the distance from the first atom"s Cα carban to the ninth atom"s Cα carban and then the distance from the second atom"s Cα carban to the third atom"s Cα carban… and so on.In our example, since the window size equals to nine, the distance matrix stores 36 distances in total.In order to describe the representative 3D structure of a cluster, we introduce Average Distance Matrix (ADM), which records the average for the distance matrices of all the sequence segments in one cluster, using the following formula:  is referred to the distance between α-carbon atom i and α-carbon atom j in the sequence segment k of the length L. N is the total number of sequences in the cluster.To calculate the structure distance between the real one and the predicted one, we use dmRMSD [36,37] described as follows: dmRMSD Where ADM j i  is used to represent the predicted sequence cluster"s 3D structure and 1 s j i  is the structure information to be predicted.M is the number of distances in the distance matrix.Since the window size we assumed is nine (L=9), M = 36.

Quality of Protein Sequence Motifs Information
Due to the fact that our main research idea is based on using protein sequence patterns generated from only sequence (1D) information to predict the protein tertiary (3D) structure, the quality of protein sequence recurring patterns dominates the success level of our experiment.As the result, improving the quality of our protein sequence pattern (motifs) information is our first priority.Intra-cluster secondary structure similarity within the protein sequence clusters is the major evaluation criteria.According to [15,20], if the structural homology for a cluster exceeds 70%, the cluster can be considered structurally identical; If the structural homology for the cluster exceeds 60% and lower than 70%, the cluster can be viewed weakly structurally homologous.Therefore, we separate the cluster quality into four classes based on the secondary structure similarity of clusters: Bad (<60%), Fair (60%~70%), Good (70%~80%) and Excellent (>80%).
In our previous work, we have successfully obtained 343 out of 799 clusters (detailed cluster quality information is available in Table 2).Next, we further extract all clusters by training a Ranking-SVM for each cluster and discard the lower 20% ranked data segments.Extra 200 high quality (secondary structural similarity > 70%) protein sequence patterns are produced [29] (detailed cluster quality information is also available in Table 2).Since our focus is on finding protein sequence motifs that crosses family boundaries, we are able to use our protein sequence patterns to predict protein local tertiary structures on all unknown protein sequences without being limited to a specific protein family.The key difference between this experiment and the latest study [29] is the inclusion of the Chou-Fasman parameter [32,33] on each data segment before the clusters are trained by the ranking SVM.The same training data set and independent testing dataset were used in this experiment as were used in our previous work [29].Table2 demonstrates the number of sequence clusters belonging to different quality categories generated by different approaches.The first row of Table2 indicates the secondary structure quality category.According to [15,20], if the structural homology for a cluster exceeds 70%, the cluster can be considered structurally identical; If the structural homology for the cluster exceeds 60% and lower than 70%, the cluster can be viewed weakly structurally homologous.Therefore, we separate the clusters into four classes based on the secondary structure similarity of clusters: Bad (<60%), Fair (60%~70%), Good (70%~80%) and Excellent (>80%).
As shown from Table2, both Super GSVM models outperform the original FGK model.Comparing the two Super GSVM models, we find that the one with Chou-Fasman does produce one more high quality clusters (secondary structural similarity > 70%).This suggests that the addition of the Chou-Fasman parameter enabled the ranking SVM to rank the belongingness of each cluster member to its particular cluster more intelligently, resulting in higher quality clusters.Although the total number of clusters greater than 60% is reduced and the difference of number of high quality clusters is not huge, the prediction accuracy for protein local tertiary structure is increased dramatically as reported in the next section.

Prediction Accuracy Comparison
In Figure 7, we provide a visual description to explain the use of protein local tertiary structure prediction.The colorful portion of figure is depicted from the local tertiary structure prediction.Researchers can use the predicted portion as anchors to expend consecutive fractions and form global tertiary structure prediction.Undoubtedly, the prediction accuracy and the prediction coverage (how many colorful portions are formed) play the key role to the success of global prediction.In this subsection, we indicate a successful prediction of local 3D structure if the average dmRMSD is less than 1.5 Å.A complete report on the prediction accuracy gen-erated from different cluster groups and the number of predicted segments is provided in Table3.The first column shows different distance thresholds (corresponding to step(4) in Figure 4) based on the distance calculation of "Distance measure together with Chou-Fasman parameter" described in 3.6.The second, fifth, and eighth column give the prediction accuracy based on given distance threshold under different cluster groups.The third, sixth, and ninth column illustrate the number of predicted sequence segments under different cluster groups.The prediction coverage (the fourth, seventh, and tenth column) is derived from the number of predicted segments divided by total number of testing sequence segments, which equals to 486,234.
A full comparison of the prediction accuracy between the Super GSVM model [29] and our newly proposed Super GSVM with Chou-Fasman parameter model is presented in Figure 8 and 9. Excellent (in Figure 8) and Good (in Figure 9) are the prediction results generated from four different groups in this paper.P-Excellent (in Figure 8) and P-Good (in Figure 9) are the prediction results re-ported in [29].As we mentioned in section 2.3, different distance thresholds generate different prediction accuracy and prediction coverage.Since the distance is calculated differently in this research and in [29], it is not useful to directly compare the accuracy-vs.-distancethreshold relationship.However, coverage is consistent between both experiments; as a result, we use coverage as X-axis in Figure 8 and 9 to show the direct comparison.
The new prediction results show a clear increase in accuracy while comparing with the previous work [29].The prediction accuracy line of the excellent group stands alone at the top of the figure and always keeps above 84%.The best prediction accuracy result in [29] is 71.98% which covers a mere 0.14% of testing dataset.Comparing the above finding with this work, the prediction accuracy is approximately 97% at the same coverage.This is a 25% prediction accuracy improvement.Even the Good group in this experiment shows better quality than the P-Excellent group.Since the fair group did not generate meaningful prediction results in both this research and [29], we just skip the comparison.Fig. 2 The sketch of the Super GSVM-FE Model

FUTURE WORKS
Without any parallelization, it took our team 18 days to generate sequence clusters from our 500MB training dataset and another three months to train Ranking-SVM on all 799 clusters.Currently, we are adapting our model to support high performance computing so that we can feasibly try many different parameters and adopt the latest data.
Multiple experiments naturally follow from this study.Firstly, we can compare the newly generated clusters with the clusters from the previous study.This could reveal a new metric for cluster quality as well as increase our understanding of the impact that slight structural modifications at the primary structural level have on the overall tertiary structure.Secondly, we might discover the best weight (in equation ( 2)) between the protein sequence and Chou-Fasman parameter to calculate the optimal distance between two sequence segments.Last but not least, an intelligent voting mechanism can be included for better prediction accuracy generation.

CONCLUSION
In conclusion, it appears that the inclusion of the Chou-Fasman parameter in the training set presented to the Super GSVM significantly increases prediction accuracies.The increase is experienced without a significant rising in the quality of the clusters as measured by secondary structure homology.This suggests that the Chou-Fasman parameter (used in the prediction of secondary structure) may hold some value in the prediction of tertiary structure that is outside of that held by secondary homology.To the best of our knowledge, it is the first time that Chou-Fasman parameter is adopted into the mechanism of protein local tertiary structural prediction.Above 90% of local tertiary structure prediction is achieved by our excellent protein sequence pattern group.The high prediction accuracy implies that it is feasible to predict local tertiary structure information based on purely sequence information.

Fig. 1
Fig. 1 The sketch of the Fuzzy Greedy K-means (FGK) Model

Fig. 3
Fig.3 The procedure of generating and extracting protein sequence motifs from primary sequence information with Chou-Fasman parameter

- Fig. 4
Pseudo code for the super Granule Support Vector Machine Model (Super GSVM)
Where1w and 2 w indicate the weight of the sequence dissimilarity and Chou-Fasman value.In this paper, both weights are equal to 1. L is the window size and M is 3 for the 3 different secondary structures (H, E and C) score values.The lower total dissimilarity value, the higher similarity the two segments have.

Table 2 .
The comparison of number of clusters belongs to different group

Table 4 .
Prediction accuracy with 1.0 Å criteria on three cluster groups under different distance threshold -