Protein Local Tertiary Structure Prediction by Super Granule Support Vector Machines with Chou-Fasman Parameter

Bernard Chen, Minwoo Kim, Matthew Johnson, Wooyoung Kim, Yi Pan

Abstract


Prediction of a protein's tertiary structure from its sequence information alone is considered a major task in modern computational biology.  In order to closer the gap between protein sequences to its tertiary structures, we discuss the correlation between protein sequence and local tertiary structure information in this paper.  The strategy we used in this work is predict small portions (local) of protein tertiary structure with high confidence from conserved protein sequences, which are called “protein sequence motifs”. 799 protein sequence motifs that transcend protein family boundaries were obtained from our previous work.  The prediction accuracy generated from the best group of protein sequence motifs always keep higher than 90% while more than 8% of the independent testing data segments are predicted. Since the most meaningful result published in latest publication is merely 70.02% accuracy under the coverage of 4.45%, the research results achieved in this paper are obviously outperformed. Besides, we also set up a stricter evaluation to our prediction to further understand the relation between protein sequence motifs and tertiary structure predictions.  The results suggest that the hidden sequence-to-structure relationship can be uncovered using the Super Granule SVM Model with the Chou-Fasman Parameter.  With the high local tertiary structure prediction accuracy provided in this article, the hidden relation between protein primary sequences and their 3D structure are uncovered considerably.

Keywords


Protein Structure Prediction, Local Tertiary Structure, Sequence Motif, Chou-Fasman Parameter, Super Granule SVM

Full Text:

PDF

References


Bork P, Gibson TJ: Applying motif andprofile searches. Methods Enzymo/1996, 266:162-184.

Bairoch A, 8ucher P, Hofmann K: PROSITE: new developments.Nucleic Acids Res 1996, 24:189-196.

Pietrokovski S, Henikoff JG, Henikoff S: The BLOCKS database - a system for protein classification. Nucleic Acids Res 1996,24:19?-200.

Attwood T. K., Beck M. E., Bleasby A. J., Degtyarenko K, Smityh D. J. P.: Progress with the PRINTS protein fingerprint database. Nucleic Acids Res 1996, 24:182-183.

Murval J, Gabrielian A, Fabian P, Hatsagi Z, Degtyarenko K, Hegyi H, Pongor S: The SBASE protein domain library, release 4.0: a collection of annotated protein sequence sgments. Nucleic Acids Res 1996, 24:210-213.

Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 28, 263±266 (2000).

Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW and Noble WS: MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Research 2009.

Bhattacharya, S. (2009). Gibbs Sampling Based Bayesian Analysis of Mixtures with Unknown Number of Components. Sankhya.Series B.To appear.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 62, 208-214.

Eskin E, Pevzner P. Finding composite regulatory patterns in DNA sequences. Bioinformatics (2002) 18:S354–S363.

Price, A., Ramabhadran, S. and Pevzner, P. A. (2003), ‘Finding subtle motifs by branching from sample strings’, Bioinformatics, Vol. 19, Suppl. 2, pp. II149–II155.

PENSA, R.G., ROBARDET, C., AND BOULICAUT, J.F. 2005.A bi-clustering framework for categorical data.In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (Porto, Portugal).643–650.

K. F. Han and D. Baker, “Recurring local sequence motifs in proteins,” J. Mol. Biol., vol. 251, no. 1, pp. 176–187, 1995.

Chen, B., Tai, P.C., Harrison, R. and Pan, Y., “FIK model: A Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery”. IEEE BIBE 2006 proceeding, 2006: p. 20-26

Sander C. and Schneider R., “Database of similarity derived protein structures and the structure meaning of sequence alignment,” Proteins: Struct. Funct.Genet.Vol.9 no. 1, pp. 56-68, 1991.

Chen, B., Tai, P.C., Harrison, R. and Pan, Y., “FGK model: A Efficient Granular Computing Model for Protein Sequence Motifs Information Discovery”, IASTED CASB 2006, Dallas, proceeding pp56-61.

Lin, T.Y. ‘Data mining and machine oriented modeling: a granular computing approach’,Journal of Applied Intelligence, Kluwer, Vol. 13, No. 2, pp.113–124, 2002.

Yao, Y.Y. ‘On modeling data mining with granular computing’, Proceedings of COMPSAC2001, pp.638–643, 2001.

Wang, G. & Dunbrack, R. L. (2003) PISCES: a protein sequence culling server in Bioinformatics pp. 1589-1591, Oxford Univ Press.

Zhong, W., Altun, G., Harrison, R., Tai, P. C. & Pan, Y. (2005) Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property, NanoBioscience, IEEE Transactions on. 4, 255-265.

Han KF and Baker D: Global properties of the mapping between local amino acid sequence and local structure in proteins. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(12):5814–5818.

Bernard Chen, Stephen Pellicer, Phang C. Tai, Robert Harrison and Yi Pan, "Novel efficient granular models for protein sequence motifs and structure discovery", International Journal of Computational Biology and Drug Design, Volume 2 - Issue 2 - 2009, pp. 168-186

Bernard Chen, Stephen Pellicer, Phang C. Tai, Robert Harrison and Yi Pan, "Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) Model for Protein Sequence Motif Information Extraction", International Journal of Functional Informatics and Personalised Medicine, 2008 Vol. 1. No. 1, pp. 8-25.

Bernard Chen, and Sinan Kockara, "Mining Positional Association Super-Rules on Fixed-Size Protein Sequence motifs", IEEE BIBE 2009, Taichung, Taiwan, proceeding pp. 1-8.

Bernard Chen, Jieyue He, Stephen Pellicer, and Yi Pan, "Protein Sequence Motif Super-Rule-Tree (SRT) Structure Constructed by Hybrid Hierarchical K-means Clustering Algorithm", IEEE BIBM 2008, Philadelphia, proceeding pp. 98-103

Bernard Chen, Stephen Pellicer, Phang C. Tai, Robert Harrison and Yi Pan, "Super Granular Shrink-SVM Feature Elimination (Super GS-SVM-FE) Model for Protein Sequence Motif Information Extraction", IEEE BIBE 2007,Boston, proceeding pp. 379-386

Bernard Chen, Stephen Pellicer, Phang C. Tai, Robert Harrison and Yi Pan, "Super Granular SVM Feature Elimination (Super GSVM-FE) Model for Protein Sequence Motif Information Extraction", IEEE CIBCB 2007, Hawaii, proceeding pp.317-323

Bystroff C, Thorsson V and Baker D: HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology 2000, 301:173–190.

Bernard Chen and Matthew Johnson, "Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)", BMC Bioinformatics 2009, 10(Suppl 11):S15

ZhongW, He J, Harrison R, Tai PC and Pan Y: Clustering SupportVector Machines for Protein Local Structure Prediction. Expert Systems with Applications 2007, 32(2):518–526.

Cortes C and Vapnik V: Support-Vector Networks. Machine Learning 1995, 20(3):273–297.

P. Y. Chou and G. D. Fasman, “Prediction of protein conformation,” Biochemistry, vol. 13, no. 2, pp. 222–245, 1974.

P. Y. Chou and G. D. Fasman, “Prediction of the secondary structure of proteins from their aminoacid sequence,” Adv Enzyol Relat Areas Mol. Biol., vol. 47, pp. 45–148, 1978.

W. Kabsch and C. Sander, “Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers, vol. 22, pp. 2577–2637, 1983.

R. Schneider, A. Daruvar, and C. Sander, “The HSSP database of protein structure-sequence alignments,” Nucleic Acids Research, Vol 25, No. 1, pp. 226-230, 1997.

R. Kolodny, and N. Linial, “Approximate protein structural alignment in polynomial time,” Proceedings of the National Academy of Science of the United States of America, 101, 12201-12206, 2004

B. Zagrovis and V. S. Pande, “How does a veraging affect protein structure comparison on the ensemble level” Biophysical Journal, 87, 2240-2246, 2004.