Genomatix-Logo
Overview of Help-Pages
GEMS Launcher Logo

ModelInspector: Search for Sequence Models


[Introduction] [Input] [Parameters] [Output] [References]

Introduction

ModelInspector uses a library of predefined models or models defined with FastM or FrameWorker to scan DNA sequences for matches to these models. A model consists of various individual elements (like transcription factor binding sites, repeats, hairpins), their strand orientation, their sequential order, and their distance ranges.

ModelInspector uses a proprietary scoring algorithm to allow inclusion of very different element types into the composite scoring of matches. Thus, IUPAC sequence elements can be successfully combined with different types of weight matrices and structural elements (e.g. hairpins) in the assessment of match quality.

The ModelInspector and FastM algorithm is described in Frech et al., 1997 (JMB), and Klingenhoff et al., 1999 (Bioinformatics).


The following predefined libraries are available:

All models of the Genomic Repeat and Long Terminal Repeat Library show a very high specificity.


Input

General: Sequence Formats
Accepted DNA sequence formats The following formats for DNA sequences are accepted: There should be only IUPAC characters in the sequence, any other characters will be skipped!
Sequence Input
Choose from your previously uploaded sequences Select a sequence file from the list of your personal sequence files which were saved in the result management in prior analyses (via "add sequences", see below).
Quick Upload new Paste your sequence(s) in the form field in one of the accepted formats (see above). Note that sequences pasted in the "quick upload" field are not saved for future use.
Add sequences

Sequences or sequence files uploaded here are automatically saved in the result management for later use:

Enter the formatted DNA sequence(s) Enter your correctly formatted sequence(s) directly into the form, e.g. with copy and paste (see above for accepted formats).
or upload a file containing sequence(s) (max. 100 MB) If your browser supports this option, a sequence file can be uploaded.
If you use this option, the file should contain the sequence(s) in either one of the formats listed above.
Please note, that the size for uploaded files is limited to 100 MB. If you want to analyze larger sequences please contact support@genomatix.de. For whole chromosomes you can use the accession number option below (e.g. 'NC_000001' for human chromosome 1).
Accession number(s) If you are interested in one or several special sequences from a database section, you can supply a list of accession numbers. If you want to select more than one accession number, please separate the accession numbers by commas or spaces.

On the Genomatix server accession numbers from the following databases can be entered:

  • GenBank (sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, Viruses, ESTs) (e.g. 'M65229')
  • Eukaryotic Promoter Database (EPD) (e.g. 'EP30014')
  • NCBI Reference Sequences (mRNA sequences) (e.g. 'NM_000402')
  • Genomatix Promoter Database (e.g. 'GXP_107276')
  • dbSNP (e.g. 'rs1234')
Database input
Select one of these database-sections On the Genomatix server the following databases are available:
  • Genomatix Promoter Database: Promoters of annotated genes
    Subset of all human, mouse, and rat promoters. Promoters of
    hypothetical proteins (e.g Loc127262) or genes that are annotated as
    "similar to ..." (e.g. Loc419384) are omitted.
  • Genomatix Promoter Database: Promoters of all genes
    All promoter sequences extracted from ElDorado genomes with "Genomatix optimized length" (1,000 bp upstream of the first TSS and 100 bp downstream of the last TSS).
  • Genomatix ElDorado Genomes
    All genomes available in ElDorado (human, mouse, rat, chimpanzee, rhesus monkey, dog, opossum, platypus, cow, horse, chicken, zebrafish, fruitfly, Anopheles, honeybee, C. elegans, Arabidopsis and rice)
  • Other databases
    • Philipp Bucher's Eukaryotic Promoter Database (EPD)
    • NCBI Reference Sequences (mRNA sequences)
  • GenBank sections
    The sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, and Viral are available.

In case you have selected a section from the GenBank database you may also restrict the analysis to sequences containing user-defined keywords in their annotation. You can enter keywords which will be searched in

  • the keyword line of the annotation
  • the description line of the annotation
  • the complete annotation

The keyword searches can be combined with "AND" or "OR". Please note that the keywords cannot contain blanks (all blanks will be skipped).

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!

ModelInspector Parameters

Model Library Selection
Library version Here you can select a previous version of the promoter module library. This can be helpful for re-producing old results. By default, the latest promoter module library is selected (please see the Module Library Release Notes). Each version of the module library corresponds to a specific version of the matrix library (the matrix library version with which the module definitions have been generated):
Module Library Version Matrix Library Version
Module Library 6.1 Matrix Library 10.0
Module Library 6.0 Matrix Library 9.4
Module Library 5.9 Matrix Library 9.3
Module Library 5.8 Matrix Library 9.2
Module Library 5.7 Matrix Library 9.1
Module Library 5.6 Matrix Library 9.0
Module Library 5.5 Matrix Library 8.4
Module Library 5.4 Matrix Library 8.3
Module Library 5.3 Matrix Library 8.2
Module Library 5.2 Matrix Library 8.1
Module Library 5.1 Matrix Library 8.0
Module Library 5.0 Matrix Library 7.1
Module Library 4.5 Matrix Library 7.0
Module Library 4.4 Matrix Library 6.3
Module Library 4.3 Matrix Library 6.2
Module Library 4.2 Matrix Library 6.1

The module library version selected also affects searches for user-defined models (i.e. the matrix library version corresponding to the module library version selected is used for the model searches). It is necessary to change the module library version (and thus the matrix library version) when user-defined models contain matrix families that have been removed or renamed in newer library versions.

Model groups Please choose one or several of the available Genomatix model libraries.
If you have created your own models with FastM or FrameWorker, they can be found in the "User-defined models" library.

You can decide if you want to

  • use all the models in the chosen libraries
  • use previously defined model subsets or
  • continue with a subset selection.

In the third case, there will be a separate page with a list of all models in the chosen libraries and you can select your model subset by clicking the checkboxes for each model.

If you started ModelInspector directly from a previous FastM session, the list will only contain the model you just created with FastM.

Search Parameters
Max. number of matches Enter the maximum number of matches in the output file.
In case the output is filtered for matches occurring in selected annotated sequence regions, only the filtered matches are considered for the maximum number of matches.

Hint for user-defined models:
If you find a lot of matches in the first few sequences of a database section (e.g. most sequence names with matches starting with A or B), you might want to change your model to be more specific (e.g. raise the matrix similarities for binding sites, thus defining a higher specificity for the binding site search; or restrict the strand orientation to the sense or antisense strand). In addition, you can supply more specific (i.e. smaller) distance ranges.

Threshold Enter a threshold for the output of model matches.
This value gives the minimum score that a match has to reach to appear in the output file. The value is given in percent of the number of individual elements of the model. Default is 100 % (i.e. all elements of the model have to be present).

Strand If this option is checked, only the top strand of the input or database sequences is scanned for model matches. Per default both strands are searched.

Annotation filter Generally, all matches are listed in the output. Alternatively, the output can be filtered for matches located in
  • regions annotated as 3'UTR
  • regions annotated as 5'UTR
  • regions annotated as exons
  • regions annotated as introns
  • regions annotated as promoters
  • regions annotated as repeats
  • unannotated sequence regions
Note: The annotation filter is only available for databases where the above mentioned features are annotated in the feature table of the database entries, i.e. the filter can be used for ElDorado genomes, EMBL and GenBank database sections.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Ranking In case one of the Genomatix promoter databases is scanned with a model, the search results are evaluated by calculation of p-values for Gene Ontology groups. This evaluation shows whether genes identified by the model are functionally related.

As the Gene Ontology ranking takes some time, it can be switched off.

Output Parameters
Offset for match positions Enter an offset (in number of basepairs, can also be negative) that will be added to each position in the output file.
This feature can be used i.e. in cases where the transcription start site is known and positions should be given relative to the TSS. E.g. if the TSS is at position 500 in a sequence, the offset should be "-500" for relative positions.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Statistics If the statistics option is checked, only the statistics output is created. In this case the number of matches is unlimited.

This option is useful if you are interested in the total number of matches in a large data set (e.g. the Human Genome) as the number of matches shown in the match overview is limited.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Alternative matches If the alternative matches option is checked, alternative model matches are displayed additionally in the detailed output.

For overlapping model matches (i.e. model matches where the start and end positions are identical or differ by less than 5 base pairs), only the model match with the highest score of individual elements is shown. The alternative model matches can be displayed optionally in the detailed output in order to check the positions and scores of the individual elements.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Output sorted by The output can be sorted
  • by the name of your models
  • by the quality of the model matches (best first)
  • or by sequence position (first match on the sequence first)
These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Output filtered for The output can be filtered for sequences in which at least the specified number of different model matches occur. Per default, all sequences with at least one model match are shown in the output.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


ModelInspector Output

ModelInspector generates three output files, the match overview, the detailed output, and the statistics file.

1. Match overview

The first output file of ModelInspector contains:

Example for the match overview:

SequenceModel
Name
PositionStrandSelect
Match
 ep029015 [E11078] (1 - 600)
 YY1F_SRFF_02257 - 275 (+)
 humactga [M19283] (1 - 575)
 YY1F_SRFF_01378 - 397 (+)
 YY1F_SRFF_02396 - 378 (-)
 musga [L21996] (1 - 601)
 YY1F_SRFF_01403 - 422 (+)
 YY1F_SRFF_02421 - 403 (-)

Example for evaluation of results:

For this example, the experimentally verified module "CDEF_CHRF_01" from the Genomatix Promoter Module Library which is involved in cell cycle regulation was searched in all human promoters.

GO ranking example


2. Detailed output

The second output file of ModelInspector contains detailed information for each individual element of the model:

Example for the detailed match list:

Inspecting sequence humactga [M19283] (1 - 575):

Model: YY1F_SRFF_01 (378 - 397 (+))

Matrix element
Model element
Position Str Sequence Core sim.
---
Mat. sim.
Model sim.
Distance to
next element
V$YY1F/YY1.01 378 - 396 (+) GATCGCCATATATGGACAT 1.000 0.757 1 bp
V$SRFF/SRF.03 379 - 397 (+) ATCGCCATATATGGACATG 1.000 0.996 ---

3. Statistics

The third output file of ModelInspector contains a statistics of the model matches and detailed information for your own models:

Example for the statistics:

Model Name # matches in # seq.
total (+) str. (-) str.
YY1F_SRFF_01 6 2 4 4


References

If you are interested in more details, ModelInspector and FastM are described in