DPAAR : a Database of Perfect Amino Acid Repeat

Received Jul 22 nd , 2014 Revised Nov 20 th , 2014 Accepted Dec 24 th , 2014 Repeat of amino acids in a protein sequence has clinical and functional importance. Database of Perfect Amino Acid Repeat (DPAAR) is a kind of relational as well as flat file database which is created by the comprehensive analysis of 5,42,782 protein sequences of Swiss-Prot database (released on 19 March,2014) to know the association between repeated sequence and disease. It provides the search engine for rapid access of a particular repeated amino acid, or particular swissprot ID, or particular length of repeated amino acids in a protein sequence. It also provides the flat files for single, oligo, and tandem repeated sequence information to get the complete informaton about concerned amino acids repeat. It consists of the tables of repeated sequence and its associated disease in human being. Keyword:


INTRODUCTION
Recent findings approves that the repeats of single, oligo, and tandem amino acid in a protein sequence is playing crucial role in various functional and evolutionary aspects ,specially its close proximity with various disease like neurodegenerative disorder, cancer, muscular dystrophy etc.Amino acid repeat can be perfect repeat or a mismatch repeat; repeats can be further of few amino acids to long span of repeat [1] .
Repeats can be classified as Homopeptide repeat or Monopeptide repeat , containing same amino acid repeat to a stretch and Heteropeptide repeat (including oligopeptide and periodic repeats) containing amino acid repeat with some combination of other amino acid repeat [2] .
The occurrence of repeated amino acid in a protein sequence is surprisingly distributed in a heterogeneous manner like presence of glycine and glutamine repeat is very high whereas presence of tryptophan repeat is negligible in whole swiss-prot database.Recent findings approves that approx 14% of all proteins containing the internal repeats and occurrence of repeats in eukaryotic protein is higher than prokaryotic protein and repetition of glutamine (30-40) has been reported in various neural diseases [3] .Many online databases efficiently describing the amino acid repeats, such as Tandem Repeat in a Protein Sequence (TRIPS) is an exclusively flat file data base [4] .ProtRepeat is a relational database of 141 organisms [5] whereas COPASAAR [6] is for 244 organisms and exclusively for single amino acid repeats.A database called RepSeq is exclusively for lower eukaryotic pathogens [7] .A common platform for search engine, flat files contents as well as repeated sequence and associated disease tables are required.In this work Search engine is designed for rapid access of the database and flat file is for to reduce the run time complexity.DPAAR database has no sequence length limitations and can find repeat in sequence ranging from few amino acids to thousands of amino acid length sequence.The database is constructed by the use of MySQL, PHP and PERL scripting languages.It has categorized the repeat under four categories: Single Amino Acid Repeat (SAAR), Oligo Amino Acid Repeat (OAAR), and Periodically Conserved Amino Acid Repeats (PCAA).

IJCB
For the detection of repeat in all the SwissProt databse, sequences were downloaded from http://www.uniprot.org/downloads.A PERL program was written with the use of regular expression and sliding window method for finding the particular repeat with defined range of there occurrence (e.g.: "A" between 5-10 amino acid repeats) .Sequences were fetched from SwissProt as an array for all protein and input them as string of length "l".A minimum required repeat range (min) and upper limit (max) was selected, and with the help of regular expression (REGEX) the repeats were obtained.In case of single amino acid and oligo amino acid repeat same method as above was applied.For amino acid detection, amino acid with different permutation and combination were selected and searched.
In case of periodically conserved amino acid, same algorithm as SAAR and OAAR was used except in this a(i) is checked with a(i+2) till "n", where "n" is the minimum value of repeat which is set of 5 for this database.The output obtained was stored in flat file and displayed in HTML pages .The data was also inserted in MySQL for user query.

Search Engine Design:
Database provides user interface, where the user can input either SwissProt ID of desired protein or monopeptide and oligopeptide amino acid or can fetch protein information based on the required amino acid length, depicted in Figure1.The database search engine displays all protein containing the required search which consists SwissProt Id, Accession No, name of the Amino Acid, Protein name, times or number of repeats, Amino acid length, and repeated sequence as depicted in Figure 2. It is also helpful for comparative study of their structure and sequence similarity between various organisms which are having same amino acid repeat with same amino acid length in other words conserved sequence of repeat can be analyzed in different organism.

Flat File Design:
The database as mentioned above was categorized like: (a) Single Amino Acid Repeat which contains information for 20 amino acids which is further divided into threesequence containing 5-10 repeats,10-20 amino acid repeat and more than 20 repeats.(b) Oligo amino acid classified under di, tri, tetra, penta and hexapeptide repeat, depicted in Figure3.(c) Periodically conserved repeat contain information for all 20 amino acid periodically repeated at every consecutive position.(d) Lastly repeat and its associated repeat contains information of disease [8] caused by mono, di and tripeptiderepeats, depicted in Figure4.

Current database summary:
Out of complete protein sequences we have detected amino acid repeats in 49,400 protein sequences.Availability of the glutamine, proline and aspartic acid is very high where as presence of tryptophan were very low depicted as in Table1 and Figure 4.It has been observed that the presence of glutamine more than 35 is alarming, causing lots of neural diseases.As per present survey mainly alanine, glutamine and glycine are involved in disease causing situation.

Figure 1 :
Figure 1: Searching options of the database.