RegionMiner subtask: Annotation and Statistics
This RegionMiner task generates annotation statistics and annotation data for chromosomal regions.
This program is highly configurable (see
Analysis Options below),
allowing users to get a very quick impression of the data (e.g. by using the classification option only)
to a very detailed look at each input regions (using the next neighbour and detailed overlap options).
Depending on the parameters, an overview table shows the number and percentage of input regions overlapping
with genomic features like
loci, exons, introns, repeats, microRNAs etc. (all data from
ElDorado).
Additionally, details for each input region or a selected subset of input regions can be viewed, including
information like the flanking genes ("next transcripts"), extent of promoter overlap or exon/intron overlap.
This data can also be exported to a Microsoft Excel™ file.
Optionally, a MatInspector search shows TFBS match numbers for each of the regions or
the presence of certain transcription factor binding sites within the input region
(for large scale TF analysis please see the RegionMiner task
Overrepresented TFs).
Note: here, overlap is defined in a strand-independant manner,
i.e. if an input region overlaps with e.g. a repeat on opposite strand, it is counted as overlapping with a repeat.
| Input |
| Input |
Input data are accepted as a tab delimited file in BED / bigBed file format containing the input regions specified at
least by chromosome number, start position and end position (in this order).
The maximum amount of regions and their maximum length can differ for various tasks.
The limits are usually shown on top of the input pages.
Within this section you can either
- choose from previously uploaded BED files
- or add a new bed file to the list (by clicking "Add Bed file...")
When adding a new file, a new window will open, asking you to either
- upload one or several BED files from your local computer
- or import a BED file from the GMS (see more details)
- or import a BED file from the GGA (see more details)
For the new BED files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that BED files critically depend on the underlying genome build,
which can be changed by selecting a different ElDorado version on the top right of the page
before uploading a BED file. You can see the list of genomes available in ElDorado.
Note that almost all browsers have a general upload limit of 2 GB,
i.e. BED files bigger than this size should be zipped before uploading from your local computer.
This restriction does not apply when using the direct import from the GGA/GMS.
Optionally you can specify a name for saving uploaded BED files on the server,
otherwise the name of the uploaded file will be used.
If several files are uploaded, the string given here will be used as prefix for each BED file name.
If any of the regions in the input file cannot be completely assigned to the selected genome
(e.g. wrong chromosome numbering or wrong positions within a chromosome),
an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file,
the complete file will be skipped.
After one or several BED files were uploaded successfully, and after closing the popup window,
the list of available BED files will be automatically updated.
Uploaded BED files can be deleted from the project anytime via the project management.
Example input:
track description="sample treatment-control analysis with 3 treatments and 3 controls"
chr1 26519270 26519623
chr1 39723904 39724119
chr2 10841542 10841853
chr2 88937859 88938309 |
| Analysis Options |
| Region Analysis |
- The classification of input regions as exons, introns, promoters,
intergenic is selected by default.
If only the classification is selected the task is very quick.
Note: Classification of regions is only available for ElDorado versions ≥ "ElDorado 07-2008".
- Optionally, a more time-consuming
next neighbor analysis can be performed. In this case, the next up-
and downstream genes are shown for each input region together with
the distance to the neighbouring transcript start.
- Additionally, each input region can be checked for overlap with
various genomic elements.
- MicroRNAs
- Transcriptional Start Regions
- Exons/Introns
- Promoters
- Repeats
The more elements are selected here, the longer the analysis will take.
|
| TF Analysis |
The sequences can be analyzed with MatInspector to show TFBS match numbers in the output
Search for specific transcription factor binding sites:
Match numbers for each selected TFBS family will be shown for each sequence,
as well as in the overlap statistics table.
A maximum of 10 different matrix families can be selected from the list.
-
Library
Here you can select a previous version of the matrix library.
This can be helpful for re-producing old results. By default, the latest matrix library is selected
(please see the Library Statistics and the Library Release Notes).
-
Matrix similarity
The matrix similarity is calculated as described in the
MatInspector papers.
A perfect match to the matrix gets a score of 1.00 (each sequence position corresponds to the highest conserved nucleotide at
that position in the matrix), a "good" match to the matrix usually has a similarity of > 0.80. Mismatches in highly conserved positions of the matrix decrease the matrix similarity more than mismatches in less conserved regions.
Increasing the matrix similarity will find less matches in your sequence, but might miss matches that do have a "mismatch" compared to the matrix.
Decreasing the matrix similarity will find more matches in your sequence.
Optimized matrix similarity: Thresholds that minimize false positives for each individual matrix are supplied with our library and can be selected from the pull-down menu (example).
These parameters are hidden by default. You can use the  next to the section header to reveal them!
|
| Output |
| Result |
Here, you can edit the default name of the result file. |
| Email address |
Here you can choose between two methods for receiving
the results:
- Show result directly in browser window
In this option the URL of the result is directly shown in your browser
window.
Warning: Please use this option
only for analyses which can be performed in a short time.
If the analysis takes longer than the timeout of the webserver, the
connection will be terminated and you will receive an error message
(e.g. "The document contained no data."). In this case, the results will
not be available, please restart the analysis using the option
below "Send the URL of the result to".
- Send the URL of the result via email
In this option an email with the URL of the results will be sent
to the user provided email address, when the analysis is finished.
The results will be available for a limited time on our server.
For details of how long your results will be kept please see the result-email.
After that period they will be deleted unless protected in the project management!
We recommend to use the email option for more than ca. 1,000 input regions. |
The output has up to five sections (depending on the analysis options).
- Analysis Parameters (always shown)
- General Statistics (always shown)
- Region Classification (if region classification option is set)
- Overlap Statistics (for overlap with genomic elements and for transcription factor binding site search)
- Annotation of Regions (for next neighbor analysis, overlap with genomic elements and transcription factor binding site search)
In the last section a number of
subsequent analysis options are available for selected subsets of regions
(see below).
- Name of input file and selected species
- Result name
- The ElDorado version used
- The selected analysis options (e.g. list of genomic elements that were checked for overlap
- MatInspector settings (if available)

The number of input regions, their total length as well as the minimum,
maximum, and average length are listed.

In case the region classification has been checked, a table with the number
of regions contained in genomic elements is given.
Each input region is classified either as
- exonic, complete or
- exonic, partial (i.e. overlapping with an exon) or
- intronic, complete or
- intergenic region
Additional to the above, a region can be classified as promoter.
A table with the distribution of the regions on the different
chromosomes of the genome is shown. The content of this table is hidden by
default, but can be shown by clicking the "Show details" link in the
header.

The classification details for each region can be downloaded as
tab-separated file. The file contains 7 columns for each region:
The file contains 7 columns (tab-separated) for each region:
1 : read id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the read
6 : end position of the read
7 : genomic elements the read is associated with
intergenic (intergenic region)
exon
intron
partial (overlapping with exon)
promoter
An individual read is assigned to one of the four classes
intergenic, exon, intron, partial and can be assigned to
the class promoter in addition.
1428 NC_000001 chr1 0 1348237 1348455 intergenic
1429 NC_000001 chr1 0 2311642 2311953 promoter intron
1430 NC_000001 chr1 0 2450272 2450768 exon
1431 NC_000001 chr1 0 2469265 2469512 intergenic
1432 NC_000001 chr1 0 3556424 3556796 promoter partial
1433 NC_000001 chr1 0 3614623 3614831 exon
- Number and percentages of regions overlapping with annotation from our genome database ElDorado
(promoters, TSRs, genomic repeats, exons, introns, microRNAs).
Note: In contrast to the region classification output described above, a region can be assigned to several genomic elements here. For example, if a region overlaps with an exon of a transcript and with an intron of an alternative transcript, it is annotated as overlapping with exons and introns.
- Statistics of transcription factor binding site matches (if available)
The table can be expanded/collapsed to show detailed number of overlap with exons and introns.
By clicking into the table headers you can change the sort order within the table and sort by different columns.
These features work only if JavaScript is enabled in your browser.

Subsets of regions can be selected for the following tasks. Depending on user selection, selection can be by the following criteria
- overlap with at least one exon/intron
- overlap with the nth exon
(in the detailed output page only)
- overlap with promoters, repeats, TSRs, microRNAs
- containing a certain transcription factor binding site
The selection can also be inverted, thus allowing e.g. extraction of all regions not
overlapping with any exons (by selecting "overlap with at least one exon" and then "invert selection").
If nothing is selected here, all regions are used for the tasks below.
Available tasks for selected regions
- Export regions to BED file format
- Download Details in EXCEL format (only if less than 65536 regions)
- Download Details in tab-separated format
- Show table with details for all regions
- Extract GeneIDs of genes where the regions overlaps with promoter
(only available if promoter overlap option is checked)
- Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions
(only available if next neighbor analysis is checked)
These tasks are described below:
Extract regions in BED file format
The content of the exported BED file will look like this:

Download Details in EXCEL / tab-separated format
This will extract detailed information for selected regions
into an Microsoft Excel™ file (only if less than 65536 regions were selected) or in tab-separated format.
Depending on user selection, the extracted file will contain information on
- the region (Chr, Begin, End, Strand, Id, Score)
- the flanking genes up/downstream on +/-strand (Transcript, GeneId, Symbol, Distance, Overlap (yes/no))
- the overlap with promoters (PromoterId, GeneId, Overlap in %)
- and information on overlapping genomic elements like microRNA and TF sites (if available for the analysis)
The file in tab-separated format can e.g. be opened with an editor or with new Excel versions.
The content of the exported annotation data table will look like this in Microsoft Excel™:

Show table with details for each region
This will open a new page containing a table with details for each region selected, together with
extensive selection and extraction options. Depending on user selection, the table contains the following data:
- Flanking gene loci and next transcripts, up- and downstream, + and - strand
- Overlapping promoters, loci, and transcripts
- Number of TFBS in the region, either for specific TFBS families or summarily for all families
- TSRs, repeats, microRNAs within or overlapping the input region
Note: The definition of next flanking transcript in RegionMiner is as follows:
the next transcript starting up/downstream (+/-) of the input region and,
for upstream transcripts, those that "come closest" to the region
(i.e. end next to the region or even overlap)
Gene names printed in bold indicate an overlap with the input region, i.e. the region overlaps
either with at least one exon or at least one intron of this locus.
A detailed MatInspector analysis of any region can be started by clicking the respective Start MatInspector link in the output.
Gene information overview is accessible for each flanking gene via GeneID links.
The ElDorado link will start an ElDorado analysis,
e.g. to view a graphical display of the input and the surrounding genomic region.

Regions can be selected here by the criteria described above, and
additionally the user can select or deselect single regions via the checkboxes.
Data export options are:
- input regions in BED file format for further analysis
- input regions as sequences (GenBank or FASTA format), optionally with flanking sequences
- annotation data table (Microsoft Excel™ file)
All export options refer only to those regions that are selected in the annotation table above, i.e.
all input regions or certain subsets can be exported. If the sequence export options are selected,
up to 3000 bp flanking the input region can be extracted.
Sequences in GenBank or FASTA format can be saved to a local file system or in your personal sequence directory.

Extract GeneIDs of genes where the regions overlaps with promoter
A list of GeneIds is extracted and can be downloaded.
The GeneIds belong to genes, where the input regions were found to overlap with their promoter.
This exported list can be used for further analysis, e.g. with GeneRanker.
Extract GeneIDs of neighboring genes with a certain maximum distance to the selected regions
A list of GeneIds is extracted and can be downloaded.
The GeneIds belong to genes, which are neighboring the input regions. The proximity
of genes to extract can be defined by the user.
Note, that only the nearest neighbour is extracted, and the distance is calculated using the next transcript start.
The exported list can be used for further analysis, e.g. with
GeneRanker.
| © 1998-2011 Genomatix Software GmbH - All rights
reserved |