Genomatix-Logo
Overview of Help-Pages
RegionMiner

RegionMiner subtask: Clustering NGS Data
(only available on GGA)


[Introduction] [Parameters] [Output]

Introduction

This RegionMiner task analyzes tags from ChIP-Seq/RNA-Seq experiments, finds significant regions, and optionally evaluates clusters with a control file. The resulting clusters can be downloaded or stored into the Genomatix project management for further analysis with other RegionMiner tasks, e.g. to search for overrepresented transcription factor binding sites.
Additionally a classification of clusters regarding their overlap with genomic elements like exons, promoters or intergenic regions is given.

For the peak finding / clustering we provide different algorithms Details on both algorithms can be found on the respective help pages.

If the user supplies a control data set, there are different strategies on how the control data is used to evaluate the cluster peaks:
Please, see this help page for a detailed description of the different evaluation strategies.


Parameters

Input
Input file with read positions from ChIPSeq (Sample)

Input data are accepted as a tab delimited file in BED / bigBed file format containing the input regions specified at least by chromosome number, start position and end position (in this order).
The maximum amount of regions and their maximum length can differ for various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED files
  • or add a new bed file to the list (by clicking "Add Bed file...")

When adding a new file, a new window will open, asking you to either

  • upload one or several BED files from your local computer
  • or import a BED file from the GMS (see more details)
  • or import a BED file from the GGA (see more details)
For the new BED files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that BED files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a BED file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. BED files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded BED files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each BED file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED files were uploaded successfully, and after closing the popup window, the list of available BED files will be automatically updated.

Uploaded BED files can be deleted from the project anytime via the project management.

Example input:

track description="sample treatment-control analysis with 3 treatments and 3 controls"
chr1       26519270        26519623
chr1       39723904        39724119
chr2       10841542        10841853
chr2       88937859        88938309

Note: The analysis requires sorting of the reads, i.e. a sorting step is additionally performed for unsorted input data. To speed up the analysis (especially if several analysis of the same input data will be performed), you can sort the input first, e.g. with the sorting action in the BED file toolbox and then use the sorted data for input.

Control file
Optional Control file

If an input control file is available, it can be uploaded here. This is an optional field, and should be left blank if no control file is available.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Parameters
Read annotation statistics

By default, the read statistics (i.e. number of reads overlapping genomic elements like exons, introns, promoters and intergenic regions) is included in the output. Optionally, the read classification (i.e. the genomic annotation of each read) can be included.

Peak Finding Algorithm For the peak finding / clustering one of two different algorithms can be selected:

The MACS peak finding algorithm should only be used for ChIPSeq data, SICER is recommended for histone modifications, whereas NGSAnalyzer can be used for clustering of ChipSeq and RNASeq data. Details on both algorithms can be found on the NGS Analyzer help page or Genomatix MACS help page or SICER help page respectively.

Parameters for Peak Finding Algorithm

Depending on the Peak Finding Algorithm selected above the corresponding parameters will appear (javascript required!).

NGS Analyzer parameters
Window size for clustering: The window size used for the clustering algorithm in basepairs. The default window size is 100 bp.
Minimum number of reads per cluster A threshold for the clustering algorithm. By default (value -1), this number is calculated from the data set applying a Poisson distribution. Otherwise, values above 3 are allowed here.

For details on the implementation see NGS Analyzer help page.

MACS parameters
Tag size The length of the input tags/reads. By default (-1), this value is determined from the input BED file, by reading the first 100 BED regions and calculating the average region length
p-value P-value cutoff for peak detection. the default is 1e-5
Bandwidth This value is used while building the shifting model. Default is: 300
mfold The upper model fold value for MACS to select the regions with a high-confidence enrichment ratio against background to build a model. The lower model fold value is automatically set to 1. If no models are found, the no-model option is used by MACS automatically.
Redundancy threshold The number of copies of identical reads allowed in a library. Values can be 'auto', 'all' or an integer
  • 'auto': MACS calculates the maximum tags at the exact same location based on binomal distribution using 1e-5 as pvalue cutoff
  • 'all': all input tags are kept
  • integer: at most this number of tags will be kept at the same location

For details on the implementation see the Genomatix MACS help page, for details on the algorithm see the MACS paper.

SICER parameters
Redundancy threshold The number of copies of identical reads allowed in a library.
Window size Resolution of SICER algorithm. For histone modifications, one can use 200 bp.
Note from the SICER manual: The choice of window size and gap size has a large effect on outcome. In general, the broader the domain, the bigger the gap should be. For histone modifications H3K4me3, W=200 and (gap = 1 window) are suggested. For H3K27me3, W=200 and (gap = 3 windows) are suggested for first try. If even bigger gap size is found to work better, you might also want to try increasing the window size eg, window size = 1K, and gap = 3 windows)
Fragment size Is for determination of the amount of shift from the beginning of a read to the center of the DNA fragment represented by the read. FRAGMENT_SIZE=150 means the shift is 75.
Gap size Needs to be multiples of window size. Namely if the window size is 200, the gap size should be 0, 200, 400, 600, ...
FDR The FDR is calculated using p-value adjusted for multiple testing, following the approach developed by Benjamini and Hochberg.
E-value nr. of islands expected in random background, only if no control data supplied
Note: E-value is not p-value. Suggestion for first try on histone modification data: E-value=100. If you find ~10000 islands using this evalue, an empirical estimate of FDR is 1E-2

For details on the implementation see the Genomatix SICER help page, for details on the algorithm see the SICER paper.

Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

We recommend to use the email option for ChIPSeq Analyses!

Output

The output generally consists of several sections:
  1. Analysis Parameters
  2. Read Classification
  3. Clustering Results
    • Detailed results for the input data set
    • Detailed results for the control data set, if control data was supplied
    • Results after peak evaluation, if control data was supplied
    • Cluster Classification
    • Cluster Classification: Details for each Cluster
  4. Download of Data Files

1. Analysis Parameters

2. Read Classification

If the read statistics was selected as parameter, a table with the number of reads from the input overlapping genomic elements like exons, introns, promoters and intergenic regions is given.
Additionally a table with the distribution of the reads on the different chromosomes of the genome is shown. The content of this table is hidden by default, but can be shown by clicking the "Show details" link in the header.

If the read classification is included in the output, detailed annotation for each read can be downloaded as a tab-separated file. For a description of the format of this file, please see the cluster classification details.

3. Clustering Results

Depending on the selected peak finding algorithm and the parameters, the output can look different in this section:

For all settings, the total number of clusters found by the program is shown and the resulting clusters can either be downloaded as a BED file or saved directly to the Genomatix project management, to be used with other RegionMiner tasks. Also, a link to the complete algorithm output is given, including details as described for NGS Analyzer or MACS or SICER respectively.

If "Peak Evaluation with Audic-Claverie Algorithm" was selected, the BED file contains only those clusters, which show a significant enrichment of reads, i.e. a subset of all significant clusters. "Significant" in this context means, that the Audic-Claverie p-value is at most as high as the cut-off specified on the input page. Additionally, a tab-separated file containing the p-values and other details for each input cluster can be downloaded (details for p-value file). This file contains all significant clusters, no matter if there is an enrichment or decrease of reads.

Note that the BED file format is zero-based and half-open, whereas numbering in the tab-separated p-value file is based at 1 and includes the end position.

cluster results output

Cluster Classification

The cluster classification provides information for
cluster statistics output

Cluster Classification: Details for each Cluster

Download cluster classification

The file that can be downloaded here contains the classification for each cluster:

The file contains 7 columns (tab-separated) for each region:

  1 : read id
  2 : contig/chromosome accession number
  3 : chromosome
  4 : strand
  5 : start position of the read
  6 : end position of the read
  7 : genomic elements the read is associated with
      intergenic (intergenic region)
      exon
      intron
      partial (overlapping with exon)
      promoter
                An individual read is assigned to one of the four classes
                intergenic, exon, intron, partial and can be assigned to
                the class promoter in addition.
1428	NC_000001	chr1	0	1348237	1348455	intergenic
1429	NC_000001	chr1	0	2311642	2311953	promoter intron
1430	NC_000001	chr1	0	2450272	2450768	exon
1431	NC_000001	chr1	0	2469265	2469512	intergenic
1432	NC_000001	chr1	0	3556424	3556796	promoter partial
1433	NC_000001	chr1	0	3614623	3614831	exon

4. Download of Data Files

All result files that can be downloaded separately from the result page together with the statistics files (in text format) can be downloaded as an archive (tar-file).