Genomatix-Logo
Overview of Help-Pages

Annotation and Statistics


[Introduction] [Parameters] [Output]

Introduction

This task generates annotation statistics and annotation data for chromosomal regions in BED, bigBed, or BAM format.

This program is highly configurable (see Analysis Options below), allowing users to get a very quick impression of the data (e.g. by using the classification option only) to a very detailed look at each input regions (using the next neighbour and detailed overlap options).

Depending on the parameters, an overview table shows the number and percentage of input regions overlapping with genomic features like loci, exons, introns, repeats, microRNAs etc. (all data from ElDorado).
Additionally, details for each input region or a selected subset of input regions can be viewed, including information like the flanking genes ("next transcripts"), extent of promoter overlap or exon/intron overlap. This data can also be exported to a Microsoft Excel™ file.
Optionally, a MatInspector search shows TFBS match numbers for each of the regions or the presence of certain transcription factor binding sites within the input region (for large scale TF analysis please see the task Overrepresented TFs).

Note: here, overlap is defined in a strand-independant manner, i.e. if an input region overlaps with e.g. a repeat on opposite strand, it is counted as overlapping with a repeat.


Parameters

Input
Input

Input data are accepted in BED / bigBed file format or BAM file format containing the input regions. For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

  • upload one or several BED/BAM files from your local computer
  • or import one or several BED/BAM files from the GMS (see more details)
  • or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window, the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the project management.

Example input:

track description="sample treatment-control analysis with 3 treatments and 3 controls"
chr1       26519270        26519623
chr1       39723904        39724119
chr2       10841542        10841853
chr2       88937859        88938309

Statistics
Statistics and Classification When this option is selected, general statistics values for the input BED/bigBed file are computed (example output below):
  • Total number of Regions
  • Total basepairs
  • Minimum Region length
  • Maximum Region length
  • Average Region length
Also, a classification of input regions as exons, introns, promoters or intergenic is done, i.e. the percentage of regions overlapping with these classes is given in the output.

If the detailed classification for each input region is of interest, the checkbox at "Include this classification for each input region in the output" can be checked. The resulting tab-separated file can then be downloaded from the output page (see below for format details).

If only the statistics are selected (and no detailed analysis below) the Annotation & Statistics task is very fast.
Note: Classification of regions is only available for ElDorado versions ≥ "ElDorado 07-2008".

Analysis Options
Detailed Region Analysis Please note, that these detailed analysis options are limited to max. 300000 input regions with at most 250000 bp each, since they are more time-consuming than the general statistics option above.
  • Optionally, a next neighbor analysis can be performed. If selected, the next up- and downstream transcripts/genes for each input region are searched and are displayed in a table together with the distance to the neighboring transcript start.

  • Additionally, each input region can be checked in detail for overlap with various genomic elements.
    • MicroRNAs
    • Transcriptional Start Regions
    • Exons/Introns
    • Promoters
    • Repeats
    Here (in contrast to the statistics analysis above) the regions are checked with which exact elements they overlap and to which extent (percentage).
    The more elements are selected here, the longer the analysis will take.
TF Analysis
The sequences of the input regions can be analyzed with MatInspector to show transcription factor binding site numbers in the output.
  • Search for specific transcription factor binding sites:
    Match numbers for each selected TFBS family will be shown for each sequence, as well as in the overlap statistics table. A maximum of 10 different matrix families can be selected from the list.

  • Library
    Here you can select a previous version of the matrix library. This can be helpful for re-producing old results. By default, the latest matrix library is selected (please see the Library Statistics and the Library Release Notes).

  • Matrix similarity
    The matrix similarity is calculated as described in the MatInspector papers.
    A perfect match to the matrix gets a score of 1.00 (each sequence position corresponds to the highest conserved nucleotide at that position in the matrix), a "good" match to the matrix usually has a similarity of > 0.80. Mismatches in highly conserved positions of the matrix decrease the matrix similarity more than mismatches in less conserved regions.
    Increasing the matrix similarity will find less matches in your sequence, but might miss matches that do have a "mismatch" compared to the matrix.
    Decreasing the matrix similarity will find more matches in your sequence.
    Optimized matrix similarity: Thresholds that minimize false positives for each individual matrix are supplied with our library and can be selected from the pull-down menu (example).

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

We recommend to use the email option for more than ca. 1,000 input regions.


Output

The output has up to five sections (depending on the analysis options).

  1. Analysis Parameters (always shown)
  2. Statistics and Region Classification (if region classification option was selected)
  3. Overlap Statistics (for overlap with genomic elements and for transcription factor binding site search)
  4. Annotation of Regions (for next neighbor analysis, overlap with genomic elements and transcription factor binding site search)
    In this section, a number of subsequent viewing and analysis options are available for selected subsets or all regions (see below).

1. Analysis Parameters

parameters

2. Statistics and Region Classification

In the general statistics table, the number of input regions, their total length in basepairs as well as the minimum, maximum, and average region length are listed.

In case the region classification has been checked, a table with the number of regions contained in genomic elements is given. Each input region is classified either as Additional to the above, a region can be classified as promoter.
A thirs table shows the distribution of the regions on the different chromosomes of the genome. The content of this table is hidden by default, but can be shown by clicking the "Show details" link in the header.

general statistics

If the detailed classification for each input region was selected, the classification details can be downloaded as tab-separated file.

The file contains 7 columns (tab-separated) for each region:

  1 : read id
  2 : contig/chromosome accession number
  3 : chromosome
  4 : strand
  5 : start position of the read
  6 : end position of the read
  7 : genomic elements the read is associated with
      intergenic (intergenic region)
      exon
      intron
      partial (overlapping with exon)
      promoter
                An individual read is assigned to one of the four classes
                intergenic, exon, intron, partial and can be assigned to
                the class promoter in addition.
1428	NC_000001	chr1	0	1348237	1348455	intergenic
1429	NC_000001	chr1	0	2311642	2311953	promoter intron
1430	NC_000001	chr1	0	2450272	2450768	exon
1431	NC_000001	chr1	0	2469265	2469512	intergenic
1432	NC_000001	chr1	0	3556424	3556796	promoter partial
1433	NC_000001	chr1	0	3614623	3614831	exon

3. Overlap Statistics

The table can be expanded/collapsed to show detailed number of overlap with exons and introns. By clicking into the table headers you can change the sort order within the table and sort by different columns. These features work only if JavaScript is enabled in your browser.

statistics

4. Annotation of Regions

Selection of Regions for further analysis

Subsets of regions can be selected for the following tasks. Depending on user selection, selection can be by the following criteria

The selection can also be inverted, thus allowing e.g. extraction of all regions not overlapping with any exons (by selecting "overlap with at least one exon" and then "invert selection").

If nothing is selected here, all regions are used for the tasks below.

image for region_selection

Available tasks for selected regions

  1. Export regions to BED file format
  2. Download Details in EXCEL format (only if less than 65536 regions)
  3. Download Details in tab-separated format
  4. Show table with details for all regions
  5. Extract GeneIDs of genes where the regions overlaps with promoter
    (only available if promoter overlap option is checked)
  6. Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions
    (only available if next neighbor analysis is checked)

These tasks are described below:

Extract regions in BED file format

The content of the exported BED file will look like this:

Bild7

Download Details in EXCEL / tab-separated format

This will extract detailed information for selected regions into an Microsoft Excel™ file (only if less than 65536 regions were selected) or in tab-separated format.
Depending on user selection, the extracted file will contain information on The file in tab-separated format can e.g. be opened with an editor or with new Excel versions.

The content of the exported annotation data table will look like this in Microsoft Excel™:

Excel annotation

Show table with details for each region

This will open a new page containing a table with details for each region selected, together with extensive selection and extraction options. Depending on user selection, the table contains the following data:

Note: The definition of next flanking transcript is as follows:

the next transcript starting up/downstream (+/-) of the input region and, for upstream transcripts, those that "come closest" to the region (i.e. end next to the region or even overlap)

Gene names printed in bold indicate an overlap with the input region, i.e. the region overlaps either with at least one exon or at least one intron of this locus.

A detailed MatInspector analysis of any region can be started by clicking the respective Start MatInspector link in the output.
Gene information overview is accessible for each flanking gene via GeneID links.
The ElDorado link will start an ElDorado analysis, e.g. to view a graphical display of the input and the surrounding genomic region.

 

annotation

Regions can be selected here by the criteria described above, and additionally the user can select or deselect single regions via the checkboxes.

Data export options are:

All export options refer only to those regions that are selected in the annotation table above, i.e. all input regions or certain subsets can be exported. If the sequence export options are selected, up to 3000 bp flanking the input region can be extracted.

Sequences in GenBank or FASTA format can be saved to a local file system or in your personal sequence directory.

fasta

Extract GeneIDs of genes where the regions overlaps with promoter

A list of GeneIds is extracted and can be downloaded. The GeneIds belong to genes, where the input regions were found to overlap with their promoter. This exported list can be used for further analysis, e.g. with GeneRanker or GePS.

Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions

A sorted list of GeneIds is extracted and can be downloaded. The GeneIds belong to genes, which are neighboring the input regions. The proximity of genes to extract can be defined by the user.
If the option "keep region assignment" is selected here, the output will also contain the regions belonging to the GeneId, e.g.

780Region_4
2968 Region_4
3796 Region_3
6242 Region_1
7841 Region_1
9235 Region_5

Note, that only the nearest neighbour is extracted, and the distance is calculated using the next transcript start.
The exported list (without region assignment) can be used for further analysis, e.g. with GeneRanker or GePS.