A Novel DNA Sequence Compression Method Based on Chaos Game Representation

K S Arun, Achuthsankar S. Nair, Oommen V Oommen

Abstract


Unique signature images derived out of Chaos Game Representation of bio-sequences is an area of research that has been confined to pattern recognition applications. In this paper we pose and answer an interesting question – can we reproduce a bio-sequence in a lossless way given the co-ordinates of the final point in its CGR image? We show that it is possible in principle, but would need enormous resolution for representation of coordinates, roughly corresponding to the information content of direct binary coding of the sequence. We go on to show that we can code nucleotide codon triplets using this method in which 16 codons can be coded using 4 bits, the remaining 48 using 6 bits. Theoretically up to 11% compression is possible with this method. However, algorithm overheads reduce this to very nominal compression percentage of less than 4% for human genome and 9% for bacterial genome. We report the results on a subset of standard test sequences and also an independent wider data set.

Keywords


Chaos Game Representation, Sequence Compression, Information-Content, Information-Entropy

Full Text:

PDF

References


Kathleen T. Alligood , Tim D. Sauer , James A. Yorke , “Chaos: An Introduction to Dynamical Systems”, Springer (1997).

Jeffrey H J Chaos game representation of gene structure. Nucleic Acids Res. 18:2163–2170 (1990).

Achuthsankar S Nair, Vrinda V Nair, K S Arun, Krishna Kant, AlpanaDey Bio-sequence Signatures Using Chaos Game Representation. In: Fulekar M H editor. Bioinformatics: Applications In Life And Environmental Sciences Springer Netherlands 62-76 (2010).

Deschavanne P J, Giron A, Vilain J, Fagot G, Fertil B Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16:1391-1399 (1999).

Antonio Neme, Antonio Nido, VíctorMireles, Pedro Miramontes The self-organized chaos game representation for genomic signatures analysis. Learning and Nonlinear Models Revista da SociedadeBrasileira de RedesNeurais (SBRN) 6: 111-120 (2008).

Joseph J, Sasikumar R Chaos game representation for comparison of whole genomes. BMC Bioinformatics 7:243 (2006).

Bai-linHao, H C Lee, Shu-yu Zhang Fractals related to long DNA sequences and complete genomes. Chaos, Solitons& Fractals 11: 825-836 (2000).

Vrinda V Nair, KarthikaVijayan, Deepa P Gopinath, Achuthsankar S Nair ANN Based Classification of Unknown Genome Fragments Using Chaos Game Representation. Proc of the Second International Conference on Machine Learning and Computing IEEE Computer Society Washington DC USA 81-85 (2010).

Grumbach S, Tahi F A new challenge for compression algorithms: Genetic sequences. J. Inform. Process. Manage. 30: 875–886 (1994).

E Rivals, J P Delahaye, M Dauchet, and O Delgrange A guaranteed compression scheme for repetitive DNA sequences. LIFL Lille I Univ Tech. Rep. IT–285 (1995).

Chen X, Kwong S, Li M A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology.61–66 (2001).

Chen X, Li M, Ma B, Tromp J DNA Compress: fast and effective DNA sequence compression. Bioinformatics 18:1696–1698 (2002).

Matsumoto T, Sadakane K, Imai H Biological sequence compression algorithms. In Genome Informatics Workshop Universal Academy Press 43–52 (2000).

Chang C H DNAC: A Compression Algorithm for DNA Sequences by Nonoverlapping Approximate Repeats. Master Thesis. National Taiwan University, Graduate Institute of Information, Taipei, Taiwan (2004).

B Behzadi, F L Fessant DNA compression challenge revisited: A dynamic programming, approach. CPM 190–200 (2005).

Neva Cherniavsky, Richard Ladner Grammar-based Compression of DNA Sequences. 2004, UW CSE Technical Report 2007-05-02(2004) .

KorodiG,Tabus I, Rissanen J, Astola J DNA sequence compression - Based on the normalized maximum likelihood model. Inst. of Signal Process. Tampere Univ. of Technol. Signal Processing Magazine IEEE 24: 47 – 53 (2007).

Craig G. Nevill-Manning, Ian H. Witten Protein is Incompressible. DCC '99 Proceedings of the Conference on Data

Compression IEEE Computer Society Washington, DC, USA 257-266.

Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears (2007) A Simple Statistical Algorithm for Biological Sequence Compression. IEEE Data Compression Conference, Snowbird, UT 42-45 (1999).

AchuthsankarS.Nair, Arun K.S .It’s 60 years since”KPB WCY XZ” became more informative than “I LOVE YOU” IEEE Potentials (ISSN: 0278-6648) vol.29, Issue 6, pp.16-19,Nov-Dec (2010).