Genomatix-Logo
Overview of Help-Pages
RegionMiner

RegionMiner subtask: Distance correlation of sets of genomic elements (GenomeInspector)


[Introduction] [Parameters] [Output] [References]

Introduction

This task calculates distance correlations between elements from at least two input files (called Anchor Set and Partner Set(s) ), or between elements in one input file and annotated genomic elements, e.g. promoters. The output is a correlation graph showing the distribution of elements from the Partner Set(s) in correlation to the anchors of the Anchor Set. The elements of the Anchor Set are defined to be at position 0 in the graph, and for each Partner Set a separate curve will be displayed.
Additionally, the correlations as well as the elements of the sets can be extracted based on the distance between the elements, thus allowing to extract elements from large sets that fulfill certain distance requirements.

Note: Distances are always calculated and displayed in 5' -> 3' direction for both strands.

Example screenshot


Parameters

Input data
Anchor Set
(Genomic elements used as anchor for correlation)

Input data are accepted as a tab delimited file in BED / bigBed file format containing the input regions specified at least by chromosome number, start position and end position (in this order).
The maximum amount of regions and their maximum length can differ for various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED files
  • or add a new bed file to the list (by clicking "Add Bed file...")

When adding a new file, a new window will open, asking you to either

  • upload one or several BED files from your local computer
  • or import a BED file from the GMS (see more details)
  • or import a BED file from the GGA (see more details)
For the new BED files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that BED files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a BED file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. BED files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded BED files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each BED file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED files were uploaded successfully, and after closing the popup window, the list of available BED files will be automatically updated.

Uploaded BED files can be deleted from the project anytime via the project management.


Alternatively, the position data of a set of genomic elements from the ElDorado database can be selected. Available elements are:

  • Primary Transcripts
  • Promoter Regions
  • Transcriptional Start Regions based on CAGE tag annotation (TSR)
  • MicroRNAs
  • different Repeats
Partner Set(s)
(to be checked for correlations to Anchor Set)
The data for the Partner Set(s) of genomic elements can be uploaded or selected as described above for the Anchor Set.
Up to 5 Partner Sets can be selected here (this will result in 5 curves in the output graph), i.e. several BED files can be selected from the "previously uploaded"-list or several different genomic elements can be selected.
Output
Organism If all input sets for GenomeInspector are genomic elements from ElDorado, the organism must be selected here. Otherwise, the species is inferred from the uploaded BED file(s) and the selection here is ignored. E.g. if a correlation between Primary Transcripts and MicroRNAs is analysed, the species must be given here.
Options Distance:
This is the distance between an element in the Anchor Set and an element in the Partner Set that will be analyzed. The default is 1000 bp, resulting in a window of 2000 bp (-1000 to +1000 from the anchor position in the Anchor Set elements) being displayed in the output graph. Required calculation time increases with maximum distance.
Note that very long distances and large input sets can lead to a server timeout.

Anchor for elements of Anchor Set
Distances can be calculated using the

  • start (most 5' position)
  • middle
  • end (most 3' position)
of Anchor Set elements as the reference position. Note that "start" means the most 5' position within an element, i.e. for elements on the (-)-strand this is the higher-numbered position!

Use only distinct elements from Anchor Set
If this option is activated, anchor positions of different elements in Set 1 that fall on the same genomic position will be counted only once.
This is important e.g. for correlations with Primary Transcripts from the database, since many alternative transcripts for a gene start at the same position (only differing in length). Without this option, all correlations of several transcripts would be counted several times and appear in the correlations graph.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Colors
For each correlation graph to appear in the output (currently up to 5) a color can be selected, allowing a user-defined combination of colors.
These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Nucleotide Content As optional additional output, the combined GC content, as well as the individual contents of A, C, G, and T can be displayed. Percentages are calculated for each position based on the alignment of elements of Set 1 at their anchor.

Note that nucleotide content statistics will slow down the program, especially for long distance ranges.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Result Here, you can edit the default name of the result file.

Output

Analysis parameters

GenomeInspector Result

The correlation table summarizes the number of correlations. It lists, for each combination of Anchor Set with Partner Set,

screenshot of table

The correlation graph shows the distance distribution of elements in the Partner Set(s) to the anchor points of elements in Anchor Set. All anchor points are aligned at position 0 in the graph.
For elements in the Partner Set(s), the whole length of the elements is used for display. Therefore, a partner element of length 100 situated between 501 and 600 bps downstream of the anchor point in a correlation pair will increase the values of the correlation curve by 1 at all positions from 501 to 600.

A, C, G, T, and GC contents are shown optionally in green, red, yellow, blue, and magenta.

The left ordinate shows the correlation count, and right ordinate the optional nucleotide content.

If only one Prtner Set was selected, additionally the mean correlation count +/- standard deviation is indicated by blue lines intersecting the left ordinate.

The button "Download Graph Values" allows to export the values from the graphics in tab-separated format for import into other programs (e.g. Excel).

Extraction options

For each correlation (i.e. for each combination Anchor Set / Partner Set), tables with can be extracted from the output. A distance range needs to be specified.
The table with e.g. the correlation information contains The information from the tables for the genomic elements in a correlation can extracted as BED files, e.g. to be used as input for other RegionMiner tasks.

Example

Consider a correlation of transcripts (as Anchor Set, with the start of the transcript as anchor point) with uploaded regions from a ChIP-Chip experiment as Set 2 elements. Setting the distance parameters for extraction to -500 to -300 allows you to extract transcripts that have a correlated ChIP-Chip region overlapping with the region 500 to 300 upstream of their TSS. Similarly, the according ChIP-Chip regions can be extracted.

Example screenshot

A corresponding correlation table might look like this:

Example table


References

GenomeInspector is described in the following publications: