Genomatix-Logo
Overview of Help-Pages

Genomatix: Distance correlation of sets of genomic elements (GenomeInspector)


[Introduction] [Parameters] [Output] [References]

Introduction

This task calculates distance correlations between elements from at least two input files (called Anchor Set and Partner Set(s) ), or between elements in one input file and annotated genomic elements, e.g. promoters. The output is a correlation graph showing the distribution of elements from the Partner Set(s) in correlation to the anchors of the Anchor Set. The elements of the Anchor Set are defined to be at position 0 in the graph, and for each Partner Set a separate curve will be displayed.
Additionally, the correlations as well as the elements of the sets can be extracted based on the distance between the elements, thus allowing to extract elements from large sets that fulfill certain distance requirements.

Note: Distances are always calculated and displayed in 5' -> 3' direction for both strands.

Example screenshot


Parameters

Input data
Anchor Set
(Genomic elements used as anchor for correlation)

Input data are accepted in BED / bigBed file format or BAM file format containing the input regions. For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

  • upload one or several BED/BAM files from your local computer
  • or import one or several BED/BAM files from the GMS (see more details)
  • or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window, the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the project management.


Alternatively, pre-defined position data of genomic elements from the ElDorado database or public sources can be selected. Available elements are:

  • Genomic elements from the ElDorado database :
    • Transcript related annotation
      • All primary transcripts from the ElDorado database
        (if this option is selected, the transcript source can also be selected (e.g. Refseq or Ensembl transcripts only, see below)
      • All exons from all primary transcripts
      • Transcriptional start regions as defined in ElDorado
    • Promoter related annotation
      • Promoter Regions
      • PromoterInspector predictions
    • Repeats
      • All Repeats (only if primates or rodents are selected as current genome)
      • ALU/B1-Repeats (primates & rodents only)
      • THE-Repeats (primates only)
      • L1-Repeats (primates & rodents only)
  • Various data from public sources
    Please note, that these options are only displayed in the selection if "human" is selection as current genome
    • ChIPSeq data from ENCODE, in supertracks or by transcription factor
    • ChIPSeq data compiled by Genomatix, in supertracks or by transcription factor
    • Evolutionary constrained elements according to Kerstin Lindblad-Toh et al., Nature 478: 476-482 (2011)
    • Chromatin state data combined into supertracks according to Jason Ernst et al., Nature 473: 43-49 (2011)
    • ENCODE Histone modification data combined into supertracks
    • ENCODE DNase hypersensitivity site data combined into supertracks
    For details on public data and further references please see this page.
Partner Set(s)
(to be checked for correlations to Anchor Set)
The data for the Partner Set(s) of genomic elements can be uploaded or selected as described above for the Anchor Set.
Up to 6 Partner Sets can be selected here (this will result in 6 curves in the output graph), i.e. several BED files can be selected from the "previously uploaded"-list or several different genomic elements can be selected.
Transcript Options
Source of transcripts If one of the anchor or partner sets is set to "transcripts" or "exons", the source of transcripts/exons used for the correlation analysis can be selected here.
Per default, all non-redundant transcripts available in ElDorado are used. Depending on the organism, several transcript sources are available. For example, human and mouse transcripts are available from
  • NCBI RefSeq
  • Ensembl
  • NCBI GenBank
For plants, additional sources may be available (e.g. Phytozome for Glycine max).
Note that display of this option depends on the selection of the correlation partner sets.
Output
Range and Elements: Distance range:
This is the distance between an element in the Anchor Set and an element in the Partner Set that will be analyzed. The default is 1000 bp, resulting in a window of 2000 bp (-1000 to +1000 from the anchor position in the Anchor Set elements) being displayed in the output graph. Required calculation time increases with maximum distance.
Note that very long distances and large input sets can lead to a server timeout.

Anchor for elements of Anchor Set
Distances can be calculated using the

  • start (most 5' position)
  • middle
  • end (most 3' position)
of Anchor Set elements as the reference position. Note that "start" means the most 5' position within an element, i.e. for elements on the (-)-strand this is the higher-numbered position!

Use only distinct elements from Anchor Set
If this option is activated, anchor positions of different elements in Set 1 that fall on the same genomic position will be counted only once.
This is important e.g. for correlations with Primary Transcripts from the database, since many alternative transcripts for a gene start at the same position (only differing in length). Without this option, all correlations of several transcripts would be counted several times and appear in the correlations graph.

Graphic Options
Colors
For each correlation graph to appear in the output (currently up to 6) a color can be selected, allowing a user-defined combination of colors.
These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Nucleotide Content As optional additional output, the combined GC content, as well as the individual contents of A, C, G, and T can be displayed. Percentages are calculated for each position based on the alignment of elements of Set 1 at their anchor.

Note that nucleotide content statistics will slow down the program, especially for long distance ranges.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

Output

Analysis parameters

GenomeInspector Result

The correlation table summarizes the number of correlations. It lists, for each combination of Anchor Set with Partner Set,

screenshot of table

The correlation graph shows the distance distribution of elements in the Partner Set(s) to the anchor points of elements in Anchor Set. All anchor points are aligned at position 0 in the graph.
For elements in the Partner Set(s), the whole length of the elements is used for display. Therefore, a partner element of length 100 situated between 501 and 600 bps downstream of the anchor point in a correlation pair will increase the values of the correlation curve by 1 at all positions from 501 to 600.

A, C, G, T, and GC contents are shown optionally in green, red, yellow, blue, and magenta.

The left ordinate shows the correlation count, and right ordinate the optional nucleotide content.

If only one Partner Set was selected, additionally the mean correlation count +/- standard deviation is indicated by blue lines intersecting the left ordinate.

The button "Download Graph Values" allows to export the values from the graphics in tab-separated format for import into other programs (e.g. Excel).

Extraction options

For each correlation (i.e. for each combination Anchor Set / Partner Set), tables with can be extracted from the output. A distance range needs to be specified.
The table with e.g. the correlation information contains
The information from the tables for the genomic elements in a correlation can extracted as BED files, e.g. to be used as input for other tasks.
When GeneIds are shown in the tables (i.e. when one of the input sets was "Primary Transcripts"), all GeneIds can be extracted as a list or can be used to directly start GePS (Genomatix Pathway System).

Example

Consider a correlation of transcripts (as Anchor Set, with the start of the transcript as anchor point) with uploaded regions from a ChIP-Chip experiment as Set 2 elements. Setting the distance parameters for extraction to -300 to -100 allows you to extract transcripts that have a correlated ChIP-Chip region overlapping with the region 300 to 100 upstream of their TSS. Similarly, the according ChIP-Chip regions can be extracted.

Example screenshot

A corresponding correlation table might look like this:

Example table

Using the 'GenomeBrowser' link in any of the rows will open a new GenomeBrowser window showing the region listed in the 'Symbol / GeneID / ...' column. If you used BED files in your correlation they will automatically be pre-loaded as tracks:


References

GenomeInspector is described in the following publications: