DNA Microarray Data Analysis : A New Survey on Biclustering

Received Oct 22 nd , 2014 Revised Nov 20 th , 2014 Accepted Dec 24 th , 2014 There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering. Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data. Keyword: Biclustering Heuristic algorithms Microarray data Genomic knowledge. Comparison of features selection


INTRODUCTION
A DNA Microarray is a glass slide covered with a chemical product and DNA samples containing thousands of genes.By placing this glass slide under a scanner, we obtain an image in which colored dots represent the expression level of genes under experimental conditions [1].This process can be summarized by Figure 1.As shown in Figure 2, the obtained colored image can be coded by a matrix M, called gene expression data, or microarray data, where the i th row represents the i th gene, the j th column represents the j th condition and the cell m ij represents the expression level of the i th gene under the j th condition.Simultaneous clustering of rows (genes) and columns (conditions) of this matrix enables to identify subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions.This type of clustering is called biclustering.Biclustering of microarray data can be helpful to discover coexpression of genes and, hence, uncover genomic knowledge such as gene networks or gene interactions.Biclustering is an NP-hard problem [3].Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions.In this paper, we make a new survey on biclustering of microarray data.
In this paper, we make a survey on biclustering of gene expression data.The rest of the paper is organized as follows: First, we introduce some definitions related to biclustering of microarray data.Then, we present in section 3 some evaluation functions and biclustering algorithms.Next, we show how to validate biclusters via biclustering tools on microarrays datasets.Finally, we present our conclusion.
Figure 1.Generation from a DNA microarray of an image where colored dots represent the expression level of genes under experimental conditions [2] Figure 2. Coding of the generated colored image to a microarray data

BICLUSTERING OF MICROARRAY DATA
Let introduce some definitions related to a biclustering of microarray data [3].We note also that a natural way to visualize a group of biclusters consists in assigning a different color to each bicluster and in reordering the rows and the columns of the data matrix so that we obtain a data matrix with colored blocks, where each block represents a bicluster.The biclustering problem can be formulated as follows: Given a data matrix M, construct a group of biclusters Bopt associated with M such that: where f is an objective function measuring the quality, i.e., degree of coherence, of a group of biclusters and BC(M) is the set of all the possible groups of biclusters associated with M. This problem is NPhard [4,5].

EVALUATION FUNCTIONS
An evaluation function is an indicator of the performance of a biclustering algorithm.There are two main classes of evaluation functions: Intra-biclusters evaluation functions and inter-biclusters evaluation functions.

Intra-biclusters evaluation functions
An intra-biclusters evaluation function is a function that measures the quality of a bicluster, i.e., it quantifies the coherence degree of a bicluster.There are several intra-biclusters evaluation functions.
• The E AVSS (I',J') is defined as follows [6]: where (I',J') is a bicluster, s ij is a similarity measure among elements of the row i and the column j with others elements belonging to I' and J'.It follows that a number of these functions are particular cases of the AVerage Similarity Score (AVSS).
• The Average Row Variance (ARV) is defined as follows [7]: where m iJ' is the average over the row i.It follows that the biclusters that contain rows with large changes in their values for different columns are characterized by a large row variance.The ARV guarantees that a bicluster captures rows exhibiting coherent trends under some subset columns.
• The Mean Squared Residue (MSR) is defined as follows [8]: where m I'J' is the average over the whole bicluster, m I' j is the average over the column j, m iJ' is the average over the row i.The E MSR represents the variation associated with the interaction between the rows and the columns in the bicluster.It follows that a low (resp.high) E MSR value, i.e., close to 0 (resp.higher than a fixed threshold d), indicates that the bicluster is strongly (resp.weakly) coherent.The E MSR function is inadequate to assess certain types of biclusters.For example, the E MSR function is good for biclusters of coherent values with additive model but not for coherent values with multiplicative model.
• The Volume (V) is defined as follows [7]: This function enables to have the maximum-sized bicluster that does not exceed a certain coherence value expressed as a MSR score.E V (I',J') finds the maximum-sized bicluster that does not exceed a certain coherence • The Mean Square Error (MSE) is defined as follows [10]: where m IJ is the average over the whole matrix, m I j is the average over the column j of the whole matrix and m iJ' is the average over the row i.This function identifies constant biclusters.
• The Average Correlation Value (ACV) is defined as follows [5,11]: is the Pearson's correlation coefficient associated with the row indices i and j (resp.k and l) in the bicluster (J',J') [8].The values of E ACV belong to [0;1], hence, a high (resp.low) E ACV value, i.e., close to 1 (resp.close to 0), indicates that the bicluster is strongly (resp.weakly) coherent.However, the performance of the E ACV function decreases when noise exists in the data matrix [5,11].
• The Average Spearman's Rho (ASR) is defined as follows [2]: ) is the Spearman's rank correlation associated with the row indices i and j in the bicluster (I',J') [12], The values of the E ASR function belong also to [-1,1], hence, a high (resp.low) E ASR value, i.e., close to 1 (resp.close to -1), indicates that the bicluster is strongly (resp.weakly) coherent.On the other hand, like Spearman's rank correlation, the E ASR is less sensitive to the presence of noise in data [2].There are other intra-biclusters evaluation function like the Average Correspondance Similarity Index (ACSI) [2].

Inter-biclusters evaluation functions
An inter-biclusters evaluation function is a function that measures the quality of a group of biclusters, i.e., it assesses the accuracy of an algorithm to recover true implanted biclusters in a data matrix.There are several inter-biclusters evaluation functions.In what follows, we present some of them: Let M 1 and M 2 be two groups of biclusters defined as follows: • The Prelic index is defined as follows: where S Prelic is based on the Jaccard index for two sets and defined as follows: This index compares two solutions based on categorization of genes.However, it compares only genes sets.
• The Liu and Wang index is defined as follows: where It compares two solutions by considering both genes and conditions.
• The wtjaccard index is defined as follows: where • The Dice index is defined as follows: where: which is proposed in [13] and called F-measure in biclustering cases to computes the overall relevance of two bicluster solutions.
• The Santamaría index is defined as follows: The Santamaría index is the most conservative index among above others indices and used for biclustering case [14,13].In fact, while the Prelic index compares only object sets and the LW index compares object sets and feature sets independently, the Santamaría index compares two solutions using pairs of genes and conditions.
For gene expression case, the Gene Match Score (GMS) function doesn't take into account column match.It is given by: The Row and Column Match Scores (RCMS) assess the method's accuracy to recover known biclusters and reveal true ones.Thereafter, more similar measures of match scores have been introduced [5,15,6].For instance, the evaluation functions, herein called Row and Column Match Scores, E RCMS1 and E RCMS2 , are proposed in [6] and [15], respectively and given by: All these measures of match score are used to assess the accuracy of an algorithm to recover known biclusters and reveal true ones.Both E RCMS1 and E RCMS2 have the advantage of reflecting, simultaneously, the match of the row and column dimensions between biclusters as opposed to E GMS that doesn't take into account column match.They vary between 0 and 1 (the higher the better the accuracy).Let B opt denote the set of true implanted biclusters in the data matrix M and B the set of the output biclusters of a biclustering algorithm.Thus, E GMS (B opt ,B) and E RCMS1 (B opt ,B) express how well each of the true biclusters are detected by the algorithm under consideration.E RCMS2 (B X ,B Y ), where B X (resp.B Y ) denotes the set of biclusters detected by the algorithm X (resp.Algorithm Y), has the particularity to allow the quantification of how well each bicluster identified by the algorithm X is contained into some bicluster detected by the algorithm Y.

BICLUSTERING ALGORITHMS
As we mentioned earlier, the biclustering problem is NP-hard [3,10].Consequently, heuristic algorithms are typically used to approximate the problem by finding suboptimal solutions.We distinguish different approaches adopted by biclustering approaches [3].

Iterative Row and Column Clustering Combination Approach
By adopting the Iterative Row and Column Clustering Combination Approach (IRCCC) approach, we apply clustering algorithms on both rows and columns separately and then combine the results to obtain biclusters [56].Table 5 is a synoptic table of biclustering algorithms adopting IRCCC approach.The conceptually simpler way to perform biclustering using existing algorithms without searching novels algorithms.But, this approach consider approximately same advantages and drawbacks that clustering algorithms used.Among the algorithms adopting this approach we mention Croki2 [58], Crobin [58], DCC [59], ITWC [61], CTWC [54] and Bi-SOM [60].

Greedy Iterative Search Approach
By adopting the Greedy Iterative Search (GIS), first, we construct submatrices of the data matrix by adding/removing a row/column to/from the current submatrix that optimizes a certain function.Then, we reiterate this process until no other row/column can be added/removed to/from any submatrix.This approach presents the same advantage and drawback as DC.They may make wrong decisions and loose good biclusters, but they have the potential to be very fast.Among the algorithms adopting this approach we mention Spectral [16], Quest [17], RandomWalkBiclustering [18], BicFinder [19], MSB [6], ISA [17,20], OPSM [21] and SAMBA [17,22].where : n and m are respectively the numbers of genes and conditions in the data matrix, l is the number of the best partial models of order, K is the maximum number of iterations, C u isthe cost of computing the new residue and the new row variance of the bicluster after performing a move, p r is a user-provided probability that the algorithm is allowed to execute a random move.

Exhaustive Bicluster Enumeration Approach
By adopting the Exhaustive Bicluster Enumeration (EBE), We identify all the possible groups of biclusters in order to keep the best one, i.e., the one that optimizes a certain evaluation function.The advantage of this approach is that it is able to obtain the best solutions.Its drawback is that it is costly in computing time and memory space Among the algorithms adopting this approach we mention BSGP [28,29], OPC [30,6], CPB [30], IT [31], e-Bmotif [29], BIMODULE [32], RAP [26], BBK [33] and MSB [6].r is the maximum weight edge in the bipartite graph G.

Distribution Parameter Identification Approach
By adopting the Distribution Parameter Identification (DPI) approach use a statistical model to identify the distribution parameters and generate the data by minimizing a certain criterion iteratively.These algorithms certainly find the best biclusters, if they exist, but have a very serious drawback.Due to their high complexity, they can only be executed by assuming restrictions on the size of the biclusters.Among the algorithms adopting this approach we mention QUBIC [38], PRMs [39], FABIA [40], BEM [41] and BCEM [42].

Divide and Conquer Approach
By adopting the Divide-and-Conquer (DC) approach, first, we start by a bicluster representing the whole data matrix then we partition this matrix in two submatrices to obtain two biclusters.Next, we reiterate recursively this process until we obtain a certain number of biclusters verifying a specific set of properties.The advantage of DC is that it is fast, its drawback is that it may ignore good biclusters by partitioning them before identifying

Biological validation
Biological validation can qualitatively evaluate the capacity of an algorithm to extract meaningful biclusters from a biological point of view.To assess biologically biclusters, we can use Gene Ontology (GO) annotation [64].In GO, genes are assigned to three structured, controlled vocabularies, called ontologies: biological process, cellular components and molecular functions.The GO Consortium (GOC) [64] [65] is involved in the development and application of the GO.In what follows, we briefly report some R tools related to GOC [66,67]: • AnnotationDbi: It provides user interface and database connection code for annotation data packages using SQLite data storage.
• FunCluster: It is a functional profiling and analysis of microarray expression data based on GO & KEGG.
• GExMap: It is an intuitive visual tool to perform a GO and to test to unveil genomic clusters, graphical interpretations and statistical results in pdf files.
• GO.db annotation: It provides detailed information about the latest version of the GOs and it is updated biannually.
• GOsummaries: It shows GO enrichment results in the context of experimental data.
• GOstats: It determines which GOs found in gene lists are statistically over/under-represented.
• goTools: It compares the GOs represented by the genes in the three gene lists (biological process, molecular function and cellular component).
• topGO: It provides tools for testing GO terms while accounting for the topology of the GO graph.Different test statistics and different methods for eliminating local similarities and dependencies between GO terms can be implemented and applied.

TOOLS
There are also many R microarray biclustering tools.Table 6.presents a few examples on tools and here are some examples [68]: • arules: It is a mining association rules and frequent item sets.It provides the infrastructure for representing, manipulating and analyzing transaction data and patterns.It also provides interfaces of the association mining algorithms Apriori and Eclat [69].
• lattice: It is a high-level data visualization system with an emphasis on multivariate data.It supports the creation of trellis graphs to display multivariate relationship between variables, conditioned on one or more other variables via R graphics [69].
• rootSolve: It finds the root of nonlinear functions, solves the steady-state conditions for uni/multi-component and equilibrium analysis of ordinary differential equations via a dynamically running; like gradient and Jacobian matrices, non-linear equations by the Newton-Raphson algorithm.
• BicARE: It is based on the FLOC algorithm [23] for biclustering analysis and results exploration.
• BcDiag: It provides methods for data pre-processing, visualization, and statistical validation to diagnostic and visualize in two-dimensional data based on two way anova [40] and median polish residual plots for biclust package output obtained from biclust, eisa-isa2 and fabia packages [17][40].In addition, the biclust package can be used via biclustGUI, i.e.R commander plug in.
• blockcluster: It performs coclustering of binary, contingency and categorical datasets with utility functions to visualize the coclustered data.It contains a function cocluster which performs coclustering and returns object of appropriate class.It also contains coclust strategy function which returns an object of class strategy.
• rqubic: It represents an implementation of the QUBIC algorithm [38] for the qualitative biclustering with gene expression data.• iBBig: It optimizes applying binary data analysis to meta-gene set analysis of gene expression datasets.It extracts iteratively groups of phenotypes from multiple studies that are associated with similar gene sets without requiring prior knowledge of the number or scale of clusters and allows discovery of clusters with diverse sizes.
• NMF: It provides a framework to perform Non-negative Matrix Factorization (NMF).It implements a set of already published algorithms and seeding methods, and provides a framework to test, develop and plug new/custom algorithms.It performs parallel computations on multicore machines.
• s4vd: It performs a biclustering via sparse singular value decomposition (svd) with a nested stability selection.The result is an biclust object and thus all methods of the biclust package can be applied.
• superbiclust: It generates as a result a number of (or super) biclusters with none or low overlap from a bicluster set, i.e. ensemble biclustering [42], with respect to the initialization parameters for a given bicluster solution.
The set of robust biclusters is based on the similarity of its elements, i.e. overlap, and on the hierarchical tree obtained via cut-off points.

DATASETS
There are many microarray datasets, related to R package, used to evaluate biclustering algorithms [68].Table 7. presents a few examples on these datasets.

CONCLUSION
The biclustering of microarray data has been the subject of a large research.No one of the existing biclustering algorithms is perfect.The construction of biologically significant groups of biclusters for large microarray data is still a problem that requires a continuous work.Biological validation of biclusters of microarray data is one of the most important open issues.So far, there are no general guidelines in the literature on how to validate biologically extracted biclusters.It is believed that the presented view and literature on biclustering will help the academicians and researchers to select appropriate approach and to apply it for the analysis of biological data.

Figure 3 .
Figure 3.Types of groups of biclusters www.ijcb.invalue[9] expressed as a MSR score.Hence, discovered biclusters have a high E V (I',J') maximized and lower E MSR than a given threshold 0   .

ISSN: 2278- 8115 IJCB Vol. 4 ,
No. 1, April 2015, 21 -37 http://www.ijcb.inwhere B 1 and B 2 are two groups of biclusters and the pair (I,J) represents the submatrix whose rows and columns are given by the set I and J, respectively.

ISSN: 2278- 8115 IJCB Vol. 4 , No. 1 ,
April 2015, 21 -37 http://www.ijcb.inwhere f is the standard normal distribution function, |1 B | is the number of 1's in the bicluster B and p= k/(|I|*|J|) of 1's in M(I,J), k is the number of 1's in the binary matrix M b .A bicluster B is considered as potentially significant at a level of significance  if p-value B < .

34ISSN
www.ijcb.in•HSSVD: It discovers and compares subgroups of patients and genes which simultaneously display unusual levels of variability.It detects both mean and variance biclusters by testing the biclustering with heterogeneous variance.

Table 1
is a synoptic table of biclustering algorithms adopting GIS approach.

Table 2
is a synoptic table of biclustering algorithms adopting EBE approach.

Table 3 .
Biclustering algorithms adopting EBE approach.is the bounded degree of gene vertices in a bipartite graph G whose two sides correspond to he set of genes and the set of conditions.

Table 4 .
Table 3 is a synoptic table of biclustering algorithms adopting DPI approach.Biclustering algorithms adopting DPI approach.

Table 6 .
Tools used to evaluate and compare biclustering algorithms