Genomatix-Logo
Overview of Help-Pages

Genomatix: Overrepresented transcription factor binding sites or modules


[Introduction] [Parameters] [Output]

Introduction

This task searches transcription factor binding sites (TFBS) within the input sequences or BED/bigBed files; and generates statistics on single TFBSs and TFBS pairs (modules) together with overrepresentation values and Z-scores.

Here, TFBS pairs are defined as two TFBS in a distance of 10 to 50 basepairs. The distance distribution of module matches in the input data is displayed, also allowing to assess preferred distance ranges (see distance score below).

The TFBS descriptions used for the analysis are either from MatBase or can be user-defined matrices created with MatDefine.

All occurrences of matches are calculated by MatInspector and displayed in table format.
The overrepresentation values are based on the background of occurrences of the TFBS

Depending on the species of the input, the corresponding sections of the matrix library are selected, i.e. for vertebrate sequences the groups "vertebrates" and "others" are used; for plant input "plants" and "others", etc. Note, that C.elegans is currently not available as there are very few TFBS descriptions available for this species.
A match list (i.e. the positions of all matches within the input) can be viewed and extracted for each matrix / matrix family. If the input was a BED file, the match positions can also be extracted as a BED file.
For further details on the found sites, a MatInspector analysis for selected transcription factors can be started from the output.

Parameters

Input
Input

Input data are accepted in BED / bigBed file format or BAM file format containing the input regions. For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

  • upload one or several BED/BAM files from your local computer
  • or import one or several BED/BAM files from the GMS (see more details)
  • or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window, the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the project management.


Additionally, this task accepts files with sequences in FASTA or GenBank format.

Please check the corresponding radio button to switch between input regions and input sequences.

When sequences are used as input, please select the corresponding organism form the list in the section "Select organism for sequence input" as the selection of the set of transcription factors used depends on the species (e.g. plant TFs for sequences from plant organisms). Note that this parameter only appears when a sequence was selected as input.
Analysis
Overrepresentation analysis can be done
  • using all TF binding sites fromMatBase
  • using modules, i.e. pairs of TF binding sites within 10 to 50 bp distance (middle to middle) to each other
  • using one or several user-defined TF binding sites generated before withMatDefine.
Matches can be either to matrix families or to individual matrices (see an example for the family concept).
For module overrepresentation analysis, you need to choose one mandatory partner in the pair on a second parameter page. A check for strand-specific modules (i.e. same-strand modules (+/+ and -/-) from different-strand modules (+/- and -/+) are distinguished) is optional.
For user-defined matrices, you need to choose one or several user-defined matrices from the list of all your matrices on a second parameter page.
Background Here the background is selected, which is used for the calculation of the overrepresentation values and the Z-Score (see below). You can select from
  • genomic background
    Genomic background comprises all chromosomes of the selected organism.
  • promoter background
    The promoter background comprises all Genomatix defined promoters of optimized length (about 500/100bp up/downstream of the TSS, details)
  • user-defined background
    If this option is selected, please supply either a sequence file or a BED file with genomic positions. These sequences will be then searched for TFBS to get the background match numbers.
Note that the analysis of "user-defined matrices with genomic background" can only be started with the email option, since the calculation of the matches takes some time.
Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

We recommend to use the email option for input regions with more than one million basepairs.

Output

A table shows overrepresentation statistics for each individual matrix, matrix family, or module. Here is an example output for a matrix family analysis:

Example screenshot

The output is initially sorted by the Z-score of overrepresentation against the genomic background.
For easier visual inspection, coloring is used: over-represented TFs (with a Z-score above 2) are colored green and under-represented TFs (Z-score below -2) are colored red.
Click on any column header to sort by that column. Repeated clicking inverts the sorting order.
The result table can be exported to a Microsoft Excel™ file or tab-separated format.

Detailed description of the table columns
TF Families / Modules with Factor x Depending on the parameters, either all transcription factor families that were analyzed are listed here or all modules (pairs of transcription factor binding sites). Clicking on the link behind the name, will show information from MatBase on this family, e.g. the transcription factors binding to the predicted sites.
Distance score This column is only available if "modules" were selected.
The distance score allows sorting for interesting modules, mainly modules that exhibit a preferred distance range in the found matches. This can be seen in a clear peak in the distance profile that is displayed when clicking the Match detail link (see below).

To calculate the distance score, all distances of the module matches in the input data are computed. Then the distance score is calculated to be
distance score = (maximum distance - average distance) / standard deviation
thus a clear peak within the distance distribution will have a large score than a uniform distribution of distances.
Promoter association TF families known to occur more than twice as often in promoters as in genomic sequence. For vertebrate promoters, these are usually G/C rich binding binding sites because vertebrate promoters have a higher G/C content than the rest of the genome. It does not mean that these sites are associated with your input promoter set, it means that they are generally associated with promoters. (this often depends on the CG content of the site).
Note: This is a property of the matrix family or module, and is independent of the input data!
Number of Input Sequences with match Number of input sequences with at least one match (of the matrix family or the module).
Number of Matches Number of matches in all input sequences.
Match details For a matrix family analysis, there are two links for each family:
  • list: A link to a list of all matches with location in the input; this list can be saved in BED format (if the input was given as a BED file)
  • seq: A link to extract all input sequences that contain a match. The sequences are extracted in GenBank format where the matches are annotated as "misc_signal".
For a module analysis, there is one link for each module:
  • list: A link to a distance profile and a list of all module matches with location in the input; this list can be saved in BED format (if the input was given as a BED file). The distance profile shows the number of module matches for each possible distances (10-50 bp) of the two partners within the module (for an example see below).
Expected ± Std.dev.
(genome / promoters / user-defined)
Expected match numbers in an equally sized sample of the selected background and the standard deviation. The number of expected matches is calculated assuming that the matches were equally distributed in the background sequences.
Overrepresentation
(genome / promoters / user-defined)
Overrepresentation against the selected background:
Fold factor of match numbers in regions compared to an equally sized sample of the background (i.e. found versus expected).
Z-Score
(genome / promoters / user-defined)
Z-score of overrepresentation against the selected background:
The distance from the population mean in units of the population standard deviation.
Here, the Z-scores are calculated with a continuity correction using the formula z = (x-E-0.5)/S, where x is the number of found matches in the input data, E is the expected value and S is the standard deviation.

Such a formula is also described in:
Sui et al
oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes
NAR 33, 2005, pp 3154-64 (PubMed 15933209)

A Z-score below -2 or above 2 can be considered statistically significant, it corresponds to a p-value of about 0.05.
Note that statistical significance does not necessarily correspond to biological significance!
Additionally you should take into account the selection and size of the input sequences (sample) and the assumed normal distribution of the background.


Example of a Distance Profile and Match List for a module analysis

Example distance profile