Development of a New Pipeline for Identification and Characterization of Micro RNAs from Plants

Received Jul 26 th , 2015 Revised Aug 20 th , 2015 Accepted Sep 24 th , 2015 Open source microRNA analysis pipelines for next generation sequencing data (NGS) often make necessary use and working knowledge of command line interface, massive data processing resources and expertise which is a daunting task for biologists. Further, the microRNA data generated from NGS platforms will not be in a form from which one could understand or make use of it. Hence a comprehensive pipeline has been developed by integrating several open source NGS tools along with a graphical user interface called sRNAbench. It is useful for expression profiling of small RNAs and prediction of microRNAs from NGS data. The pipeline features functionalities such as read processing, sequence identification, target prediction and enrichment analysis. It provides even prediction of novel microRNAs and its sequences. The pipeline will be very useful for plant genomics community and it does not require knowledge in computational biology in order to discover miRNAs and utilize the same in genomics studies. Keyword:


INTRODUCTION
The central dogma of molecular biology, i.e. the flow of biological information from DNA to RNA to protein, has been challenged by the discovery of a gene regulation mechanism, known as RNA-mediate gene silencing (RNA silencing).It is the process whereby small, approximately 20-24 nucleotide (nt), non-coding RNA molecules direct the sequence-specific down-regulation of target genes at the post-transcriptional level.There are two known classes of RNAs to mediate this process, namely microRNAs (miRNAs) and short interfering RNAs (siRNAs), which act as cellular regulators to control mRNA stability and translation, to protect the genome from the invading nucleic acids, to facilitate epigenetic modifications of chromatin and histones, and to direct complex developmental pathways [1].
The term "microRNA" was first coined in 2001 when tens of small RNAs with regulatory potential were discovered in Caenorhabditis elegans [2].Later, the scientific world witnessed large number of research programmes in this area.For example, a term search for "microRNAs" in PubMed clearly reveals the increasing interest by listing about 36,740 research papers as on July 2015.These research were made possible through next-generation sequencing (NGS) which allows the sequencing of small RNA molecules and the estimation of their expression levels.Consequently, there is high demand of bioinformatics tools to cope with the several gigabytes of sequence data generated in each of the deep-sequencing experiments.Plant miRNAs target recognition mechanism was once thought to be simple and straightforward.Therefore, very few target prediction tools and algorithms were developed for plants as compared to those for animals.However, later studies revealed the enormous diversity and complexity of the gene regulation by miRNAs in plant systems.This, in turn, necessitates the need for advanced computational tools/algorithms for comprehensive miRNA target analysis to help understand miRNA regulatory mechanisms [3].In this regard, many new programs such as psRNATarget was developed [4] for target prediction of plant endogenous noncoding short small RNAs.miRBase was created as an online repository for miRNA sequence data which was integrated with interfaces for comprehensive analysis of miRNA sequence data, annotation and prediction of gene targets [5].Later, another powerful platform named miRanalyzer: a miRNA detection and analysis tool for NGS experiments was launched [6,7].Subsequently, in order to find differentially expressed miRNAs and to predict novel miRNA candidates, miRSeqNovel was developed [8].

ISSN
Even though excellent tools are available for miRNA analysis, no pipeline has been set-up using tools available in the public domain which require minimum computational knowledge.Hence, the present study was aimed at developing an analysis pipeline for miRNA sequence data using those freely available tools.The pipeline was formulated by analyzing the miRNAs that are expressed in Brachypodium distachyon roots and leaves under drought stress conditions (details of the experiment was available at NCBI database under SRA id 160390).

Sequence retrieval from SRA
RNA sequence reads of Brachypodium distachyon leaves and roots in control and drought stress conditions were accessed from NCBI SRA database (SRA 160390).

Read processing using sRNAbench
The reads were processed using sRNAbench (http://bioinfo5.ugr.es/srnatoolbox/srnabench),replacement for popular miRanalyzer program.It is a web based tool for processing small RNA reads that are obtained from high throughput sequencing data.The datasets were provided as input to the program by means of URL obtained from the SRA database.Then the "Do not map to genome (Library mode)" check box was activated followed by selecting Arabidopsis thaliana (tair 10) as a reference species (Brachypodium distachyon was not listed in the sRNAbenc as a reference species during our analysis).The sRNAbench will use the annotations from the sRNAbench database for the selected species during the reads mapping.If no species is selected and the 'Do not map to genome (Library mode)' check box is not activated, then, sRNAbench will only analyse microRNAs.If a species is selected and the 'Do not map to genome (Library mode)' check box is not activated sRNAbench will use the genome mapping mode.Adapter trimming was done by selecting Illumina RA3 with Recursive Adapter trimming selected.Then, the microRNA analysis was done for the species selected during the Select species step.Then the analysis parameters were specified following minimum read count as 2, number of allowed mismatches as 0, seed length of alignment as 20, alignment type as 'n', remove barcode as 0, minimum read length to predict new miRNAs as 15 and maximum number of multiple mapping as 10.The sRNAbench job IDs were used to specify the differential expression analysis.

microRNA sequence identification using miRBase
miRBase (http://www.mirbase.org) is a biological database that acts as an archive of miRNA sequences and annotations.The miRBase registry provides a centralized system for assigning new names to miRNA genes.The sequences obtained from the sRNAbench were given as input to miRBase for sequence identification.

microRNA target prediction using psRNATarget
psRNATarget, is a plant small RNA target analysis server, which features two important analysis functions: reverse complementary matching between miRNA and target transcript using a proven scoring schema, and target site accessibility evaluation by calculating unpaired energy (UPE) required to "open" secondary structure around miRNA"s target site on mRNA.The sequence obtained from miRBase was given as input to the psRNATarget and selected preloaded transcript or genomic library from the available list.This resulted in a list of predicted miRNA/Target pairs with its alignment.

Enrichment analysis using agriGO and REVIGO
The results obtained from the psRNATarget was given to agriGO toolkit as in the format of gene list obtained from phytozome which enabled analysis of GO relationships in graphical format along with GO terms, GO source and its description.This obtained separate graphs for each of the three GO categories, namely biological process, molecular function and cellular component.Singular enrichment analysis was carried out as to obtain desired results.REVIGO was used to create semantic similarity based scatter plots by removing redundancy which makes the results more easier to interpret.

RESULTS AND ANALYSIS
The role played by miRNAs at various stages of development was worked out in many plant species.For example, the role of miRNAs in drought response was investigated in young leaves of Brachypodium distachyon, a drought-tolerant monocot model species.NGS data analysis identified 66 annotated miRNA genes and 122 new high confidence predictions greatly expanding the number of known Brachypodium miRNAs.Most miRNAs showed a high expression level, consistent with their involvement in early leaf development and cell identity [9].
In the present study, the raw data from the 8 reads (SRX160390 though SRX160397) deposited at the NCBI SRA database after NGS were processed using sRNAbench.Tabulated outputs of the sRNAbench analysis are given as Table 1 summarizing read quality, genome mapped reads and MIR profiling results.Using sRNAbench our read was mapped to Arabidopsis thaliana (tair 10).Total of 338 mature miRNAs were detected and one novel miRNA was predicted with a total read count of 2017.The differential expression analysis using sRNAbench provided list of miRNAs that are differentially expressed.The miRNA sequences for the differentially expressed ones were fetched using miRBase (Table 2).Subsequently, miRNA target prediction was carried out using psRNA Target.The genelist obtained from psRNATarget was provided as input to agriGO and the annotation was performed.The miRNAs were assigned to various cellular functions as given in Figure 1 and graphical representations of the biological processes were obtained using agriGO (Figure 2).The results showed 36 GO terms enriched for biological process, 19 for molecular function, and 10 terms for cellular component.In general, the GO terms that were significantly enriched are related to cellular and metabolic processes.The REVIGO analysis was carried out to visualize the GO results.

CONCLUSION
An attempt was made to develop a user friendly pipeline for analyzing NGS data for identifying miRNAs.The aim was to make use of publicly accessible platforms with graphical user interfaces which will have less computational complexity.The pipeline was developed by analyzing NGS data from Brachypodium distachyon under drought stress conditions.An analysis pipeline was developed using sRNAbench for read processing, miRNA sequence identification using miRBase, target prediction using psRNATarget and enrichment analysis using agriGO and REVIGO.
The differential expression analysis using sRNAbench generated a list of miRNAs that are differentially expressed which was used to find out the gene targets using psRNATarget.From the miRNA target analysis we obtained 118 target genes for the differentially expressed 32 miRNA sequences, from which a gene list was prepared.The results showed many GO terms enriched for biological processes, molecular functions and cellular components.The new pipeline has successfully identified miRNAs related to primary and secondary metabolic process, catabolic process, carbohydrates metabolic process and cellular homeostasis.

Figure 1 .
Figure 1.GO flash chart obtained from the agriGO analysis of the gene list