Weighted Alignment Free Dissimilarity Metric for Promoter Sequence Comparison

Received Feb 10 th , 2015 Revised Apr 7 th , 2015 Accepted Apr 10 th , 2015 Comparative sequence analysis has been a powerful tool in bioinformatics which interprets knowledge about the functionality of a sequence, making use of its structural information. Among the non coding regions of DNA, the comparison of promoter sequences has received a great deal of attention in medical science as promoter regions play a crucial role in gene regulation. In this work we propose an alignment free sequence comparison metric for comparison of promoter sequences. We use the binary and decimal position specific motif matrices (PSMM) of the promoters which were created for our experiments using the TFSEARCH tool. Simple weighted algorithm is used to compute the dissimilarity between the PSMMs of promoter sequences, thereby analyzing its underlying homology and functionality. The NCBI database was used to obtain the promoter sequences of 500 nucleotides upstream the transcription start site (TSS) of enzyme pyruvate kinase (PKLR) from the glycolysis pathway of different organisms for one experiment and all the enzymes from the glycolysis pathway of organism human for the other. The proposed dissimilarity metric is successful in bringing out differences on both the datasets and the results regarding similarities and differences in promoter sequences could be essential to have a clear knowledge of transcription regulation process in different organisms.The results reveal some useful findings which can be extended for a broader investigation. Keyword:


INTRODUCTION
Understanding the accumulated knowledge related to huge data has gained importance in the recent years.Extracting knowledge from such huge data is more challenging than the analysis of simple conventional data.One of the most analyzed massive data is the biological sequences.Promoter sequences in the non-coding regions of the genes are one such data which are less explored when compared to the coding regions of the genes.Promoters can prove to be one of the keys to understand the underlying hypotheses of gene expression and regulation which is mainly controlled by transcription factors (TFs), that is, proteins that bind to promoter regions at specific sites (TFBSs) and regulate the process of transcription initiation.The knowledge that governs the RNA synthesis using RNA polymerase is positioned in the promoter region that lies between 200-2000 nucleotides upstream of the Transcription Start Site (TSS) of a gene.The TFs interact with the sequence specific elements or motifs, which are 5-12 nucleotides in length.The motifs appear to be arranged in a specific configuration that confers on each gene an individualized spatial or temporal transcription program [1].The lack of nucleotide sequence conservation between functionally related promoter regions makes it challenging to come out with an efficient computational model for promoter sequence analysis.Thus, the promoter regions of genes with similar expression pattern may not show sequence similarity even though they may be regulated by similar configuration of TFs [2].
A considerable amount of work has been carried out in aligning coding regions of DNA sequences for finding homology between different species.A variety of computational methods are available for sequence alignment namely, Bayesian methods [3], Dot matrix [4] [5], Scoring matrices [4], Dynamic programming [6], Hidden Markov Models [7] [8], Neural networks [9], Genetic algorithms [10], Word-based techniques etc. Plenty of alignment programs based on these methods exist, notably Clustal W, Clustal X, GRAIL, BLASTX, FASTA and MEME.Most of these methods use nucleotide comparison which is not suited for promoter regions.Few tools that align promoter sequences are CONREAL, Monkey and AVID.Despite the recent progress in phylogenetic foot printing techniques, lack of nucleotide conservation partially explains the limited success of the available computational models for promoter comparison [11].
In promoters, functionality is based on motif conservation and not nucleotide conservation.Hence, this type of comparison is not suited for promoters.There have been many successful attempts on similarity analysis and comparison of promoter sequences which perform alignment and few are alignment free [6][12][13] [14].However, we propose an alignment free sequence comparison metric for promoter sequence comparison.
In this work, we have extracted promoter sequences consisting of one prokaryote and nine eukaryotes.We use the NCBI database to extract the entire possible promoter of enzyme pyruvate kinase (dataset 1) from the glycolysis pathway and also the promoters of 10 enzymes from glycolysis pathway of organism human (dataset 2).Only few reports regarding the comparison of promoter sequences of enzymes of a metabolic pathway exist.These obtained promoters are varying in length (500bps -35,000bps).For analysis we have considered 500bps from each of the promoter sequence and then used TFSEARCH tool to obtain the motif details and later created the normalized position specific motif matrices (PSMM) from the same.This results in matrices of size 73x10, 79x10 for each promoter of dataset 1 and dataset 2 respectively.The total number of motifs extracted by the tool is 73 in dataset 1 and 79 motifs were extracted from dataset 2. The matrix is a record of the presence (entry in the cell is 1 or more) or absence (entry in the cell is 0) of a motif in various positions 1-50, 51-100, … 451-500 of the 500 bps considered.Then we apply the proposed alignment free sequence comparison method for the analysis of these promoter sequences.The steps are described in detail in the next section.

MATERIALS AND METHODS
The overall schema of the method adopted is as described in the Fig 1 .The PSSM creation and the proposed dissimilarity measures are discussed in detail in the next section.

The PSMM Creation
The PSMM of promoter sequences of both the dataset is obtained by the following steps.At first, we obtain the nucleotides in the region behind the start of PKLR gene till the end/start of the gene behind it for forward strands/backward strands, from the NCBI.This constitutes the entire promoter of that particular gene.Then, we spot the first 'TATA box' behind initiator codon (ATG) of the gene and then take 500bps behind that TATA for our experimentation.Later, these sequences are submitted to the 'TFSEARCH' [15] tool with a default threshold of 85% (specified in tool) to obtain the details of all the motifs with their position and score.This process is carried out for all the organisms listed below in Table 1 and the enzymes listed in Table 2.The PSMM matrix is generated by considering all the unique motifs identified by the TFSEARCH tool and are greater than 5bp in length.The matrix has 10 columns representing positions 1-50, 51-100, up to ......451-500 of the considered 500 bps.The number of rows depends on the number of unique motifs present in that particular promoter.We build an initial matrix where the multiple occurrences of a motif are counted.We have considered its binary matrix indicating just presence or absence of a motif in a position for our 1st experiment and matrix with actual number of motif occurrences for 2nd experiment.In case the motif is spread over two different ranges then we put '1' in the cell/cells where at least 40% of the motif is present.Every promoter has different number of motifs resulting in varying number of rows.Hence before we submit the matrix to our algorithm we do initial pre processing by normalizing the number of rows by simply taking union of all the motifs from different promoters.The input matrix of the organism Homo sapien is as shown in Table 3.

: ( , ) ( , ) ( , ) Triangle Inequality D A B D B C D A C
 

Metric: Weighted column sum
Let A and B be the PSSMs of the two promoters under comparison.We perform logical 'bit XOR' operation as in equation 1 on these matrices.This operation results in a matrix 'X' containing 1 in the bit position if corresponding bits of matrices A, B are dissimilar and 0 if the bits are similar.

X A B  (1)
The column sum 'CS' of each column 'j' in the matrix 'X' is computed using the equation 2 and then the weighted column sum 'WCS' is calculated by multiplying each CS with numerical weights in the increasing order starting with 1 as described in the equation 3. Later, the average weighted column sum 'AWCS' is calculated by dividing ACS by the size of matrix 'X'.The AWCS value constitutes the dissimilarity score between the promoters A and B.
/ ( * ) AWCS WCS r c  Where 'r' is the number of rows and 'c' is the number of columns in matrix 'X'.CS (j) measures the dissimilarity present between the promoters due to various motifs in the position 'j'.Position 'j' is assigned weight inversely proportional to its distance from TATA box.WCS denotes the total weighted dissimilarity of the two promoters and AWCS gives the normalized dissimilarity.

Experiment 1 (binary matrices ignoring multiple occurrences of a motif)
The results show lower dissimilarity for organisms belonging to the same family.This can be seen in Fig 3.b for organisms 1, 2 and 6 which belong to family 'Hominidae'.There is considerably higher dissimilarity for organisms of different families.The dissimilarity is considerably high between a prokaryote and a eukaryote in dataset 1.The results on dataset 2 show that the dissimilarity is quite uniform suggesting existence of some similarity between the enzymes of the glycolysis pathway as plotted in     In this case the dissimilarity values were scaled up as the multiple occurrences of the motif were taken into account.In dataset 2, we observe significant similarity between the promoter sequences of all enzymes involved in the glycolysis pathway as plotted in Fig 6 .All the enzymes had HSF, CdXA, ADR1 and cap in common in their promoters.This observation also suggests that the expression of the enzymes in a single metabolic pathway is likely to be regulated by similar pattern of TFBs in a promoter sequence.However, this similarity analysis should be carried out and compared for different metabolic pathways which would give an insight into the specific or ubiquitous nature of the regulatory motifs in promoter sequences.

CONCLUSION
This paper successfully demonstrates the use of proposed weighted column sum dissimilarity metric for promoter sequence similarity analysis.Demarcation between prokaryotes and eukaryotes has been observed.These differences in prokaryotic and eukaryotic promoter sequence motifs suggest the distinct patterns of transcription regulation and this could be helpful in designing pathogen specific drugs.Moreover, the information regarding the similarities and differences in promoter sequences could be essential to have a clear knowledge of transcription regulation process in different organisms.These results might also lead to some inferences on the evolutionary process of transcription from prokaryotes to eukaryotes.
We also notice that the promoter regions of genes with dissimilar motif pattern and arrangement may show sequence similarity even though they may not be regulated by similar motif configuration.All these results are from analysis of normalized promoter PSMM.We can also extend the experiments onto varied size PSMM and compare the obtained results.Perhaps a combination of non coding region (promoter) and coding region similarity score can give us a better understanding of the underlying homology existing between these genes.

Figure 1 .
Figure 1.Architecture of the proposed model

A
dissimilarity measure D is considered a metric if it satisfies the following relations  sequences A

Figure 4 . 2 
Figure 4. a. Dissimilarity score plotted of HK1 promoter with the promoters of rest of the 9 enzymes of glycolysis (dataset2).Figure 4. b.Dissimilarity of HK1 with rest 9 enzymes.
Figure 5.a Dissimilarity of human promoter with the rest of 9 organisms (dataset1).Figure 5.b.Dissimilarity of organisms 1, 2, 6 with the rest of 7 organisms.

Table 4 and
Table 5 are the dissimilarity matrices of dataset 1 and dataset 2 respectively.

Table 6 -
The dissimilarity matrix of experiment 2 on dataset 1.

Table 7 -
The dissimilarity matrix of experiment 2 on dataset 2.