Annotation and Statistics
This task generates annotation statistics and annotation data for chromosomal regions
in BED, bigBed, or BAM format.
This program is highly configurable (see
Analysis Options below),
allowing users to get a very quick impression of the data (e.g. by using the classification option only)
to a very detailed look at each input regions (using the next neighbour and detailed overlap options).
Depending on the parameters, an overview table shows the number and percentage of input regions overlapping
with genomic features like
loci, exons, introns, repeats, microRNAs etc. (all data from
ElDorado).
Additionally, details for each input region or a selected subset of input regions can be viewed, including
information like the flanking genes ("next transcripts"), extent of promoter overlap or exon/intron overlap.
This data can also be exported to a Microsoft Excel™ file.
Optionally, a MatInspector search shows TFBS match numbers for each of the regions or
the presence of certain transcription factor binding sites within the input region
(for large scale TF analysis please see the task
Overrepresented TFs).
Note: here, overlap is defined in a strand-independant manner,
i.e. if an input region overlaps with e.g. a repeat on opposite strand, it is counted as overlapping with a repeat.
Input |
Input |
Input data are accepted in
BED / bigBed file format or
BAM file format containing the input regions.
For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks.
The limits are usually shown on top of the input pages.
Within this section you can either
- choose from previously uploaded BED/BAM files
- or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files
from the list. All selected files will then be treated as replicates.
When adding a new file, a new window will open, asking you to either
- upload one or several BED/BAM files from your local computer
- or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build,
which can be changed by selecting a different ElDorado version on the top right of the page
before uploading a file.
You can see the list of genomes available in ElDorado.
Note that almost all browsers have a general upload limit of 2 GB,
i.e. files bigger than this size should be zipped before uploading from your local computer.
This restriction does not apply when using the direct import from the GGA.
Optionally you can specify a name for saving uploaded files on the server,
otherwise the name of the uploaded file will be used.
If several files are uploaded, the string given here will be used as prefix for each file name.
If any of the regions in the input file cannot be completely assigned to the selected genome
(e.g. wrong chromosome numbering or wrong positions within a chromosome),
an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file,
the complete file will be skipped.
After one or several BED/BAM files were uploaded successfully, and after closing the popup window,
the list of available BED/BAM files will be automatically updated.
Uploaded BED or BAM files can be deleted from the project anytime via the
project management. Example input:
track description="sample treatment-control analysis with 3 treatments and 3 controls"
chr1 26519270 26519623
chr1 39723904 39724119
chr2 10841542 10841853
chr2 88937859 88938309 |
Transcript Options |
Source of transcripts |
The source of the transcripts used for the annotation analysis can be selected here. Per default, all non-redundant transcripts available in ElDorado are used. Depending on the organism, several transcript sources are available. For example, human and mouse transcripts are available from
- NCBI RefSeq
- Ensembl
- NCBI GenBank
For plants, additional sources may be available (e.g. Phytozome for Glycine max).
|
Statistics |
Statistics and Classification |
When this option is selected, general statistics values for the
input BED/bigBed file are computed (example output below):
- Total number of Regions
- Total basepairs
- Minimum Region length
- Maximum Region length
- Average Region length
Also, a classification of input regions as exons, introns, promoters or
intergenic is done, i.e. the percentage of regions overlapping
with these classes is given in the output.
If the detailed classification for each input region is of interest, the checkbox at
"Include this classification for each input region in the output" can be checked. The
resulting tab-separated file can then be downloaded from the output page
(see below for format details).
If only the statistics are selected (and no detailed analysis below)
the Annotation & Statistics task is very fast.
Note: Classification of regions is only available for ElDorado versions ≥ "ElDorado 07-2008".
|
Analysis Options |
Detailed Region Analysis |
Please note, that these detailed analysis options
are limited to max. 300000 input regions with at most 250000 bp each,
since they are more time-consuming than the general statistics option above.
Optionally, a
next neighbor analysis can be performed. If selected, the next up-
and downstream transcripts/genes for each input region
are searched and are displayed in a table together with
the distance to the neighboring transcript start.
- Additionally, each input region can be checked in detail for overlap with
various genomic elements.
- MicroRNAs
- Transcriptional Start Regions
- Exons/Introns
- Promoters
- Repeats
Here (in contrast to the statistics analysis above) the regions are checked
with which exact elements they overlap and to which extent (percentage).
The more elements are selected here, the longer the analysis will take.
|
TF Analysis |
The sequences of the input regions can be analyzed with
MatInspector to show transcription factor binding site numbers in the output.
Search for specific transcription factor binding sites:
Match numbers for each selected TFBS family will be shown for each sequence,
as well as in the overlap statistics table.
A maximum of 10 different matrix families can be selected from the list.
-
Library
Here you can select a previous version of the matrix library.
This can be helpful for re-producing old results. By default, the latest matrix library is selected
(please see the Library Statistics and the Library Release Notes).
-
Matrix similarity
The matrix similarity is calculated as described in the
MatInspector papers.
A perfect match to the matrix gets a score of 1.00 (each sequence position corresponds to the highest conserved nucleotide at
that position in the matrix), a "good" match to the matrix usually has a similarity of > 0.80. Mismatches in highly conserved positions of the matrix decrease the matrix similarity more than mismatches in less conserved regions.
Increasing the matrix similarity will find less matches in your sequence, but might miss matches that do have a "mismatch" compared to the matrix.
Decreasing the matrix similarity will find more matches in your sequence.
Optimized matrix similarity: Thresholds that minimize false positives for each individual matrix are supplied with our library and can be selected from the pull-down menu (example).
These parameters are hidden by default. Clicking on  will reveal them.
|
Output |
Result |
Here, you can edit the default name of the result file. |
Email address |
Here you can choose between two methods for receiving
the results:
- Show result directly in browser window
In this option the URL of the result is directly shown in your browser
window.
Warning: Please use this option
only for analyses which can be performed in a short time.
If the analysis takes longer than the timeout of the webserver, the
connection will be terminated and you will receive an error message
(e.g. "The document contained no data."). In this case, the results will
not be available, please restart the analysis using the option
below "Send the URL of the result to".
- Send the URL of the result via email
In this option an email with the URL of the results will be sent
to the user provided email address, when the analysis is finished.
The results will be available for a limited time on our server.
For details of how long your results will be kept please see the result-email.
After that period they will be deleted unless protected in the project management!
We recommend to use the email option for more than ca. 1,000 input regions. |
The output has up to five sections (depending on the analysis options).
- Analysis Parameters
(always shown)
- Statistics and Region Classification
(if region classification option was selected)
- Overlap Statistics
(for overlap with genomic elements and for transcription factor binding site search)
- Annotation of Regions
(for next neighbor analysis, overlap with genomic elements and transcription factor binding site search)
In this section, a number of
subsequent viewing and analysis options are available for selected subsets or all regions
(see below).
- Name of input file and selected species
- The ElDorado version used
- Result name
- The selected analysis options (e.g. list of genomic elements that were checked for overlap
- MatInspector settings (if available)

In the general statistics table, the number of input regions, their total length in basepairs as well as the minimum,
maximum, and average region length are listed.
In case the region classification has been checked, a table with the number
of regions contained in genomic elements is given.
Each input region is classified either as
- exonic, complete or
- exonic, partial (i.e. overlapping with an exon) or
- intronic, complete or
- intergenic region
Additional to the above, a region can be classified as promoter.
A thirs table shows the distribution of the regions on the different
chromosomes of the genome. The content of this table is hidden by
default, but can be shown by clicking the "Show details" link in the
header.

If the detailed classification for each input region was selected,
the classification details can be downloaded as
tab-separated file.
The file contains 7 columns (tab-separated) for each region:
1 : read id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the read
6 : end position of the read
7 : genomic elements the read is associated with
intergenic (intergenic region)
exon
intron
partial (overlapping with exon)
promoter
An individual read is assigned to one of the four classes
intergenic, exon, intron, partial and can be assigned to
the class promoter in addition.
1428 NC_000001 chr1 0 1348237 1348455 intergenic
1429 NC_000001 chr1 0 2311642 2311953 promoter intron
1430 NC_000001 chr1 0 2450272 2450768 exon
1431 NC_000001 chr1 0 2469265 2469512 intergenic
1432 NC_000001 chr1 0 3556424 3556796 promoter partial
1433 NC_000001 chr1 0 3614623 3614831 exon
- Number and percentages of regions overlapping with annotation from our genome database ElDorado
(promoters, TSRs, genomic repeats, exons, introns, microRNAs).
Note: In contrast to the region classification output described above, a region can be assigned to several genomic elements here. For example, if a region overlaps with an exon of a transcript and with an intron of an alternative transcript, it is annotated as overlapping with exons and introns.
- Statistics of transcription factor binding site matches (if available)
The table can be expanded/collapsed to show detailed number of overlap with exons and introns.
By clicking into the table headers you can change the sort order within the table and sort by different columns.
These features work only if JavaScript is enabled in your browser.

Subsets of regions can be selected for the following tasks. Depending on user selection, selection can be by the following criteria
- overlap with at least one exon/intron
- overlap with the nth exon
(in the detailed output page only)
- overlap with promoters, repeats, TSRs, microRNAs
- containing a certain transcription factor binding site
The selection can also be inverted, thus allowing e.g. extraction of all regions not
overlapping with any exons (by selecting "overlap with at least one exon" and then "invert selection").
If nothing is selected here, all regions are used for the tasks below.
Available tasks for selected regions
- Export regions to BED file format
- Download Details in EXCEL format (only if less than 65536 regions)
- Download Details in tab-separated format
- Show table with details for all regions
- Extract GeneIDs of genes where the regions overlaps with promoter
(only available if promoter overlap option is checked)
- Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions
(only available if next neighbor analysis is checked)
These tasks are described below:
Extract regions in BED file format
The content of the exported BED file will look like this:

Download Details in EXCEL / tab-separated format
This will extract detailed information for selected regions
into an Microsoft Excel™ file (only if less than 65536 regions were selected) or in tab-separated format.
Depending on user selection, the extracted file will contain information on
- the region (Chr, Begin, End, Strand, Id, Score)
- the flanking genes up/downstream on +/-strand (Transcript, GeneId, Symbol, Distance, Overlap (yes/no))
- the overlap with promoters (PromoterId, GeneId, Overlap in %)
- and information on overlapping genomic elements like microRNA and TF sites (if available for the analysis)
The file in tab-separated format can e.g. be opened with an editor or with new Excel versions.
The content of the exported annotation data table will look like this in Microsoft Excel™:

Show table with details for each region
This will open a new page containing a table with details for each region selected, together with
extensive selection and extraction options. Depending on user selection, the table contains the following data:
- Flanking gene loci and next transcripts, up- and downstream, + and - strand
- Overlapping promoters, loci, and transcripts
- Number of TFBS in the region, either for specific TFBS families or summarily for all families
- TSRs, repeats, microRNAs within or overlapping the input region
Note: The definition of next flanking transcript is as follows:
the next transcript starting up/downstream (+/-) of the input region and,
for upstream transcripts, those that "come closest" to the region
(i.e. end next to the region or even overlap)
Gene names printed in bold indicate an overlap with the input region, i.e. the region overlaps
either with at least one exon or at least one intron of this locus.
A detailed MatInspector analysis of any region can be started by clicking the respective Start MatInspector link in the output.
Gene information overview is accessible for each flanking gene via GeneID links.
The ElDorado link will start an ElDorado analysis,
e.g. to view a graphical display of the input and the surrounding genomic region.

Regions can be selected here by the criteria described above, and
additionally the user can select or deselect single regions via the checkboxes.
Data export options are:
- input regions in BED file format for further analysis
- input regions as sequences (GenBank or FASTA format), optionally with flanking sequences
- annotation data table (Microsoft Excel™ file)
All export options refer only to those regions that are selected in the annotation table above, i.e.
all input regions or certain subsets can be exported. If the sequence export options are selected,
up to 3000 bp flanking the input region can be extracted.
Sequences in GenBank or FASTA format can be saved to a local file system or in your personal sequence directory.

Extract GeneIDs of genes where the regions overlaps with promoter
A list of GeneIds is extracted and can be downloaded.
The GeneIds belong to genes, where the input regions were found to overlap with their promoter.
This exported list can be used for further analysis, e.g. with
GeneRanker or GePS.
Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions
A sorted list of GeneIds is extracted and can be downloaded.
The GeneIds belong to genes, which are neighboring the input regions. The proximity
of genes to extract can be defined by the user.
If the option "keep region assignment" is selected here,
the output will also contain the regions belonging to the GeneId, e.g.
780 | Region_4 |
2968 | Region_4 |
3796 | Region_3 |
6242 | Region_1 |
7841 | Region_1 |
9235 | Region_5 |
Note, that only the nearest neighbour is extracted, and the distance is calculated using
the next transcript start.
The exported list (without region assignment) can be used for further analysis, e.g. with
GeneRanker or
GePS.
© 2022 Precigen Bioinformatics Germany GmbH - All rights
reserved |