![]() |
![]() |
This task searches transcription factor binding sites (TFBS) within the input sequences and generates statistics on single TFBSs and TFBS pairs (modules) together with overrepresentation values and Z-scores.
The TFBS descriptions used for the analysis are either from MatBaseor can be user-defined matrices created withMatDefine.
All occurrences of matches are calculated by MatInspector and displayed in table format.
The overrepresentation values are based on the background of occurrences of the TFBS
Depending on the species of the input, the corresponding sections of the matrix library are selected, i.e. for vertebrate sequences the groups "vertebrates" and "others" are used; for plant input "plants" and "others", etc. Note, that C.elegans is currently not available as there are very few TFBS descriptions available for this species.
A match list (i.e. the positions of all matches within the input) can be viewed and extracted for each matrix / matrix family. If the input was a BED file, the match positions can also be extracted as a BED file.
For further details on the found sites, a MatInspector analysis for selected transcription factors can be started from the output.
| Input | |
|---|---|
| Input |
Input data are accepted as a tab delimited file in BED / bigBed file format containing the input regions specified at
least by chromosome number, start position and end position (in this order).
When adding a new file, a new window will open, asking you to either
For the new BED files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that BED files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a BED file. You can see the list of genomes available in ElDorado. Note that almost all browsers have a general upload limit of 2 GB, i.e. BED files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS. Optionally you can specify a name for saving uploaded BED files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each BED file name. If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped. After one or several BED files were uploaded successfully, and after closing the popup window,
the list of available BED files will be automatically updated.
Uploaded BED files can be deleted from the project anytime via the project management. Additionally, this task accepts files with sequences in FASTA or GenBank format. Please check the corresponding radio button to switch between input regions and input sequences. |
| Analysis | Overrepresentation analysis can be done
Matches can be either to matrix families or to individual matrices (see an example for the family concept). For module overrepresentation analysis, you need to choose one mandatory partner in the pair on a second parameter page. A check for strand-specific modules (i.e. same-strand modules (+/+ and -/-) from different-strand modules (+/- and -/+) are distinguished) is optional. For user-defined matrices, you need to choose one or several user-defined matrices from the list of all your matrices on a second parameter page. |
| Background | Here the background is selected, which is used for the calculation of the overrepresentation values and the Z-Score (see below).
You can select from
|
| Output | |
| Result | Here, you can edit the default name of the result file. |
| Email address | Here you can choose between two methods for receiving
the results:
The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management! We recommend to use the email option for input regions with more than one million basepairs. |
A table shows overrepresentation statistics for each individual matrix, matrix family, or module.
The output is initially sorted by overrepresentation against the genomic background. Click on any column header to sort by that column. Repeated clicking inverts the sorting order.
The result table can be exported to a Microsoft Excel™ file or tab-separated format.
| Detailed description of the table columns | |
|---|---|
| TF Families / Modules with Factor x | Depending on the parameters, either all transcription factor families that were analyzed are listed here or all modules (pairs of transcription factor binding sites). Clicking on the link behind the name, will show information from MatBase on this family, e.g. the transcription factors binding to the predicted sites. |
| Distance score | This column is only available if "modules" were selected.
The distance score allows sorting for interesting modules, mainly modules that
exhibit a preferred distance range in the found matches. This can be seen in a clear peak
in the distance profile that is displayed when clicking the
Match detail link (see below).
To calculate the distance score, all distances of the module matches in the input data are computed.
Then the distance score is calculated to be
distance score = (maximum distance - average distance) / standard deviation
thus a clear peak within the distance distribution will have a large score
than a uniform distribution of distances.
|
| Promoter association | TF families known to occur more than twice as often in promoters as in genomic sequence.
For vertebrate promoters, these are usually G/C rich binding binding sites because vertebrate promoters
have a higher G/C content than the rest of the genome. It does not mean that these sites are associated
with your input promoter set, it means that they are generally associated with promoters. (this often
depends on the CG content of the site). Note: This is a property of the matrix family or module, and is independent of the input data! |
| Number of Input Sequences with match | Number of input sequences with at least one match (of the matrix family or the module). |
| Number of Matches | Number of matches in all input sequences. |
| Match details | For a matrix family analysis, there are two links for each family:
|
| Expected ± Std.dev. (genome / promoters / user-defined) |
Expected match numbers in an equally sized sample of the selected background and the standard deviation. The number of expected matches is calculated assuming that the matches were equally distributed in the background sequences. |
| Overrepresentation (genome / promoters / user-defined) |
Overrepresentation against the selected background: Fold factor of match numbers in regions compared to an equally sized sample of the background (i.e. found versus expected). |
| Z-Score (genome / promoters / user-defined) |
Z-score of overrepresentation against the selected background: The distance from the population mean in units of the population standard deviation. Here, the Z-scores are calculated with a continuity correction using the formula z = (x-E-0.5)/S, where x is the number of found matches in the input data, E is the expected value and S is the standard deviation. Such a formula is also described in: Sui et al oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes NAR 33, 2005, pp 3154-64 (PubMed 15933209) A Z-score below -2 or above 2 can be considered statistically significant, it corresponds to a p-value of about 0.05. Note that statistical significance does not necessarily correspond to biological significance! Additionally you should take into account the selection and size of the input sequences (sample) and the assumed normal distribution of the background. |

| © 1998-2011 Genomatix Software GmbH - All rights reserved |