Genomatix-Logo
Overview of Help-Pages

Genomatix: Comparative MicroRNA analysis
(only available on GGA)


[Introduction] [Parameters] [Output]

Introduction

This task analyzes count data and performs a comparative analysis of microRNA (or more generally non-coding RNA) expression.
Input for this task can be
If two data sets are supplied (e.g. treatment vs. control, or condition A vs. condition B, or tissue A vs. tissue B), the task performs a differential analysis, i.e. calculates lists of significantly up- and down-regulated microRNAs/non-coding RNAs.
If replicates for treatment and control data are available, the user can select from different methods like 'DESeq' or 'edgeR' to calculate the differential expression.

Differential Analysis

If the expression values for two different conditions (here called "treatment" and "control" for simplicity) are to be compared, the following statistical testing methods for evaluating differential expression are available:

While the Audic-Claverie-method does not handle replicates, 'DESeq2', 'DESeq' and 'edgeR' were developed specifically for replicate data. Moreover, edgeR cannot be used if there are no replicates available.

Audic and Claverie introduced a formula to compute a conditional probability for observing N reads (treatment) in a class given that M reads were observed before (control). These p-values, in combination with the Genomatix normalized expression (NE) value are used to evaluate differential expression.

The 'DESeq2', 'DESeq' and 'edgeR' methods model count data (here the number of reads from an RNA-Seq experiment mapped to a transcript) by a negative binomial distribution. The parameters of the distribution (mean and dispersion) are estimated from the data, i.e. from the read counts in the input files. Each method computes a measure of read abundance, i.e. expression level (called 'base mean' or 'mean of normalized counts' in DESeq/DESeq2, and 'concentration' or 'counts-per-million' in edgeR) for each transcript and apply a hypothesis test to each transcript to evaluate differential expression. In particular, the three methods determine a (adjusted) p-value and a log2 fold change (in expression level) for each transcript.
One parameter can be set for DESeq: the dispersion estimates are found by fitting a curve through the per-transcript dispersion estimates. The way this fitting is done can be specified to be either 'parametric' (the default in DESeq) or 'local'. Default settings are used for the other parameters, in particular single pooled values are used as empirical dispersion estimates, and the maximum of the empirical and fitted values is used the dispersion for a transcript resp. cluster. If there are no replicates, the settings are changed to the 'blind' method for computing the empirical dispersion estimates, and the fitted dispersion values are used. Sometimes, the parametric fitting fails, and in this case, the analysis should be rerun with the 'fitting method' set to 'local'. For details please refer to the DESeq vignette.
For DESeq2, two parameters are settable: The testing for differential expression can either be done with a Wald test or a Likelihood-ratio test. The former is the default testing method in DESeq2, while the latter is the one in use for DESeq. The other settable parameter is - as for DESeq - the fitting method used in dispersion estimation. See the DESeq2 vignette for details.
edgeR normalizes the count data using the TMM (trimmed mean of M-values) method introduced by Robinson and Oshlack. All parameters used in the edgeR algorithm as set to their respective default values. In particular, tagwise (i.e. per-transcript) dispersion estimation is used, with the tagwise dispersions squeezed towards the common dispersion, as described in the edgeR vignette.
Before the analysis, any transcripts without any mapped reads are removed from the dataset, i.e. from the input file 91.input_replicate_analysis, all transcripts are removed, which have a read count of '0' in all samples. In the output file 92.output_replicate_analysis, these transcripts are listed with the value 'NA' in each output column except the 'id' (p-value, fold-change etc.).

For defining up- and down-regulated transcripts between two conditions or samples, the following criteria are used (parameters set by the user):

Note that the first input set is regarded as "treatment", whereas the second input file is used as "control", i.e. "up-regulation" refers to a higher expression in set1 than in set2. Also note that the direction of up- and down-regulation will change if the two data sets are exchanged in the input.

Parameters

Input
ncRNA file(s) for condition 1 and 2
MicroRNA/non-coding input data is accepted if it is in one of the following two formats:
  • tab-separated file, containing an ID for microRNA/ncRNA in column 1 and count data (integer) in column 2

    Lines starting with a # are treated as comments and are ignored.
    Also, any additional values found after column two will be ignored.

    Note: When several files are compared (sample/control or replicates), the same number of IDs should occur in all input files.
    IDs that have no values in all input files will be deleted from the analysis.

    Example:
    #miRNA               read count
    hsa-miR-101-5p       0
    hsa-miR-101-3p       14
    hsa-miR-103a-3p      1636
    hsa-miR-103a-2-5p    7
    

  • output files from the smallRNA mapping task

    Here, the input files are tab-separated files consisting of exactly 10 columns generated by a mapping against the smallRNA library. There, the mapping result is classified into 10 classes of noncoding RNA according to the sequence ontology project (sequenceontology.org). These files are called <smallRNA class>_nc.tsv and contain a listing of the total read counts for each small RNA in the library, with genomic positions for all organisms.
    Example:

    #org chr   posfrom   posto    strand  GXN       name          total hits    relative hits   annotation
    hsa  chr9  97848308  97848330     0   GXN_144   hsa-miR-24-1*    74         Infinity     hsa-miR-24-1* MIMAT0000079 Homo sapiens miR-24-1*
    hsa  chr9  97848344  97848365     0   GXN_8422  hsa-miR-3074-5p  176481     Infinity     hsa-miR-3074-5p MIMAT0019208 Homo sapiens miR-3074-5p
    hsa  chrX  73438265  73438286     0   GXN_1233  eca-miR-421      2          Infinity     eca-miR-421 MIMAT0013216 Equus caballus miR-421
    hsa  chrX  73438391  73438413     0   GXN_2923  hsa-miR-374b*    14         Infinity     hsa-miR-374b* MIMAT0004956 Homo sapiens miR-374b*
    

    For this GGA task the microRNA_m_nc.tsv files are relevant. These files contain the read counts for the mature (21-23 bp) and biologically active microRNAs as annotated in the miRBase database. The "microRNA_h_nc.tsv" files, in contrast, examine the genomic hairpin structures (70-80 bp) from which the mature microRNAs are eventually generated via the Dicer enzyme process.

You can either
  • upload one or several ncRNA files from your local computer
  • or import the file(s) from the GMS (see more details)
  • or import the file(s) from the GGA (see more details)
Note: If multiple files are uploaded, they are treated as replicates for one condition.
Organism
Filter Depending on the type of input, this parameter will have different consequences:
  • For tab-separated input files with ID and count data only, this organism will influence the links to the Genomatix GenomeBrowser in the output.
  • For input files from the smallRNA mapping task the following applies: Since the smallRNA Mapping reports matches to microRNAs in all available organisms, they need to be filtered in a first step to the pertinent species. This species can be set here.
    Hairpin microRNA sequences that align to 100% in the genome of the target organism are included even if they were originally found in another organism. Please refer to the smallRNA mapping help page for a more in-depth explanation of these "trans-mapped hairpin microRNA sequences".
Significance Analysis
Differential Analysis Parameters

The differential analysis parameter section will only appear, if at least one control file was uploaded in the section above.
There are four available algorithms for calculating the differential expression/enrichment values:

  • DESeq (recommended for replicate data, but does work on non-replicates, too)
    It is possible to select the 'fitting type' parameter for DESeq, i.e. the way how the curve is fitted through the dispersion estimates. For details on the meaning of this parameter please refer to the DESeq vignette.
  • DESeq2 (recommended for replicate data, but does work on non-replicates, too)
    As for DESeq, it is possible to select the 'fitting type' parameter for DESeq2, i.e. the way how the curve is fitted through the dispersion estimates.
    Additionally, DESeq2 offer two alternative methods for testing for differential expression: Wald test and Likelihood-ratio test (with Wald test being the default).
    For details on the meaning of the parameters please refer to the DESeq2 vignette.
  • edgeR (recommended for replicate data, does not work on non-replicates)
  • the procedure introduced by Audic and Claverie (if no replicates are available)
For a short introduction to the different methods, see above in the Introduction to the Differential Analysis.

The thresholds that define a transcript as differentially expressed (or a region as enriched/depleted) can be set here. There are two criteria, that are combined (both must be satisfied for differential expression/enrichment):

  • an adjusted p-value threshold for the significance of observing the detected change
    Note that the p-values calculated by the different methods (DESeq/DESeq2, edgeR, Audic-Claverie) can differ.
    Also note, that setting the p-value to 1 allows skipping of this criterium.
  • a threshold for the log2 fold change of expression/enrichment level
    A log2 ratio of 1 is a fold change of 2; a log2 ratio of 0.585 is a fold change of 1.5; e.g. if the log2 fold change of expression/enrichment is set to ≥ 1, the expression values must go up by at least 100% to appear in the differentially expressed transcripts/enriched regions list.
    The log2 fold change thresholds can be set separately for up- and down-regulation (enrichment/depletion).
    Note, that by setting the log2 fold change thresholds to 0, fold changes are ignored in the analysis.

Principal Component Analysis
PCA If this parameter is set, a Principal Component Analysis (PCA) is automatically started on the condition / control files in order to identify subgroups or outliers.
Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


Output

The output consists of the following:

  1. Analysis Parameters
  2. Differential ncRNA Analysis
  3. PCA Overview, if parameter PCA is set
  4. PCs (principal components), if parameter PCA is set
  5. Download of Data Files
The result sections are described in detail below.

1. Analysis Parameters


2. Differential ncRNA Analysis

For the differential expression analysis, a comparison of the expression values of the two input data sets (treatment versus control, possibly each with replicate data) is done. First the comparison is done on microRNA/ncRNA level by the selected method (DESeq, edgeR, or Audic/Claverie), i.e. each microRNA is checked, whether it fulfills the user-defined thresholds regarding

Note that the log2 fold change values cannot be calculated under certain conditions (e.g. if no expression is detected for a transcript in the control set). Such cases are indicated by a "-Inf", "Inf" or "NA" value in the output.

Overview

The following numbers are given in an overview table
microRNA overview

The download links below the numbers allow accessing tab-separated data files (suffix *.tsv), containing details like microRNA name, read numbers, p-value, log2 fold change for each microRNA; for details of the format see below.

Those microRNA, that were found to be significantly expressed (by user criteria) are listed in a table:
microRNA overview
Please note, that hairpin microRNA sequences that align to 100% in the genome of the target organism are included even if they were originally found in another organism.

Most columns of the table can also be found in the 10.microRNA_summary.tsv (for details see below). They are shortly explained here:

microRNA name name of the microRNA
Annotation Annotation details for this microRNA.
A microRNA originally found and annotated for one organism may raise a signal in an experiment involving another organism, due to sequence similarity or for biological reasons.
p-value p-value (depends on the selected method)
adj. p-value adjusted p-value(depends on the selected method)
log2(fold change) log2(expression value of control data set / expression value of treatment data set)
Note, that this value can be -Inf/+Inf if one of the conditions shows no expression
Regulation Regulation of treatment (set1) compared to control (set2), (values can be "up", "down", "no")
rel.hits relative number of hits for this microRNA for each replicate from the treatment sets and the control sets
Genome link to the Genomatix GenomeBrowser for a graphical display of the region(s) where the corresponding hairpin structure of this microRNA is located. This can be relevant in examining the possible regulation of this microRNA.
If there are multiple hairpin locations in the genome, there will be a link to each of them in the GenomeBrowser.
For a given mature microRNA, a hairpin location is not necessarily known.

The table can be interactively sorted by various criteria like p-value or up-/down-regulation. This is done by clicking on the corresponding column header.
The width of a column can be changed by dragging the divider beside the column header.

Plots

Depending on the method used for differential analysis, scatter and volcano plots are displayed.

Note: The graphics are at first depicted as small icons, which can be enlarged by clicking on them.

For a detailed explanation of the various plots please refer to the help page of the differential expression task.


3. PCA Overview

Overview output of the Principal Component Analysis.

4. PCs (principal components)

The top loadings for the first principal components accounting for 90% of the variance in the data.


5. Download of Data Files

All result files that can be downloaded separately from the result page together with the statistics files (in text format) can be downloaded as an archive (tar-file).

Here is an overview of results and their corresponding file names (for format details see below):

Details onFilename
Main results:
all analyzed microRNAs 10.ncRNA_summary.tsv
differentially expressed microRNAs 11.significant_ncRNA.tsv
If differential analysis was selected:
count data used for the test method 91.input_replicate_analysis
library size, i.e. the total read numbers 91.input_replicate_analysis.libsize
result from the test method 92.output_replicate_analysis
If PCA analysis was selected:
Loadings plot for the first two PCs Loadings.png
Information displayed in loadings tables Loadings.xml
Plot of the top 40 loadings for PCs contributing to 90% of variance (maximum 10) Loadings_PCx.png
Score plot for first two PCs (in png and pdf format) ScorePlot2D.png /.pdf
Scree plot for all computed PCs ScreePlot.png
Information displayed in overview tables Statistics.xml
Loadings for each transcript/locus for each PC calculated loadings.tsv
Scores for all samples for each PC calculated scores.tsv
Plots:
MA plot 93.MA_plot.png
Volcano plot of adjusted p-values 94.volcano_plot.png
Histogram of unadjusted p-values 95.pvalue_histogram.png
Dispersion plot (for DESeq/DESeq2) 96.dispersion_plot.png
BCV plot (for edgeR) 96.bcv_plot.png

Data files for all analyzed microRNAs and those containing only significant microRNAs

 1: Genomatix microRNA Id
 2: microRNA name
 3: Annotation details for this microRNA
 4: p-value (depends on the selected method)
 5: adjusted p-value(depends on the selected method)
 6: log2(fold change), i.e. log2(expression value of control data set / expression value of treatment data set),
     note, that this value can be -Inf/+Inf if one of the conditions shows no expression
 7: Regulation of treatment (set1) compared to control (set2), (values can be "up", "down", "no")
 8: was the microRNA found to be significant (this value depends on user settings for the analysis)
 
 the following columns depend on the number of input files:
 - number of reads for each replicate from the treatment sets and the control sets
 - relative number of hits for this microRNA for each replicate from the treatment sets and the control sets
Here is an example of the output format:
GX_ID     microRNA name       Annotation                                     ...
GXN_6537  bta-miR-1814a  bta-miR-1814a MIMAT0011874 Bos taurus miR-1814a ....
GXN_6624  bta-miR-2404   bta-miR-2404 MIMAT0011961 Bos taurus miR-2404   ....

... p-value               adj. p-value          log2 (fold-change)  Regulation  significant?  ...
... 1.06554329607119e-06  0.00042195514524419   -3.40094966114274     down        1           ...
... 5.34623064669077e-07  0.000264638417011193  -5.44369293283376     down        1           ...

...#hits treat1  #hits ctrl1    rel.hits treat1      rel.hits ctrl1
...  1276         4955          0.00242099511247382   0.0255730056410283
...     1           16          1.89733159284782e-06  8.25768093353083e-05

Data files if one of the test methods 'edgeR', 'DESeq' or 'DESeq2' was selected