Genomatix-Logo
Overview of Help-Pages
MatInspector-Logo

MatInspector: Search for transcription factor binding sites


[Introduction] [Input] [Parameters] [Output] [Lines of evidence] [References]
[Release Notes] [Library Statistics]

Introduction

MatInspector is a software tool that utilizes a large library of matrix descriptions for transcription factor binding sites to locate matches in DNA sequences. MatInspector is almost as fast as a search for IUPAC strings but has been shown to produce superior results. It assigns a quality rating to matches and thus allows quality-based filtering and selection of matches.

The first version of MatInspector is described in Quandt et al., 1995 (NAR). A paper describing various new features of MatInspector has been published in 2005 (Cartharius et al., 2005, Bioinformatics).

Features of MatInspector


More background reading: [Algorithm details] [Matrix data]

Input

Generally, MatInspector can

Gene Name Input
Search promoters by gene

Use the combo box to select a gene. See below for details.
Please note, that this option only works in combination with a valid "Genes & Genomes" account.

Sequence Input
Choose from your previously uploaded sequences Select a sequence file from the list of your personal sequence files.
or enter the formatted DNA sequence(s) Enter your correctly formatted sequence(s) directly into the form, e.g. with copy and paste.
The following formats are accepted: There should be only IUPAC characters in the sequence, any other characters will be skipped!
or upload a file containing sequence(s) (max. 100 MB) If your browser supports this option, a sequence file can be uploaded.
If you use this option, the file should contain the sequence(s) in either one of the following formats: Please note, that the size for uploaded files is limited to 100MB. If you want to analyze larger sequences please contact support@genomatix.de. For whole chromosomes you can use the accession number option below (e.g. 'NC_000001' for human chromosome 1).
or enter accession number(s) If you are interested in one or several special sequences from a database section, you can supply a list of correct accession numbers in the form. If you want to select more than one accession number, please separate the accession numbers by commas or spaces.

On the Genomatix server accession numbers from the following databases can be entered:

  • GenBank (sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, Viruses, ESTs) (e.g. 'M65229')
  • Eukaryotic Promoter Database (EPD) (e.g. 'EP30014')
  • NCBI Reference Sequences (mRNA sequences) (e.g. 'NM_000402')
  • Genomatix Promoter Database (e.g. 'GXP_107276')
  • dbSNP (e.g. 'rs1234')
Search corresponding promoters for your sequence(s)

If you activate this checkbox, your input sequence(s) is/are mapped against the organism which you choose from the drop-down list. See below for details.
Please note, that this option only works in combination with a valid "Genes & Genomes" account.

Database input
Select one of these database-sections On the Genomatix server the following databases are available:
  • Genomatix Promoter Database: Promoters of annotated genes
    Subset of all human, mouse, and rat promoters. Promoters of
    hypothetical proteins (e.g Loc127262) or genes that are annotated as
    "similar to ..." (e.g. Loc419384) are omitted.
  • Genomatix Promoter Database: Promoters of all genes
    All promoter sequences extracted fromElDoradogenomes with "Genomatix optimized length" (500 bp upstream of the first TSS and 100 bp downstream of the last TSS).
  • Genomatix ElDorado Genomes
    All genomes available inElDorado(human, mouse, rat, chimpanzee, rhesus monkey, dog, opossum, platypus, cow, horse, chicken, zebrafish, fruitfly, Anopheles, honeybee, C. elegans, Arabidopsis and rice)
  • Other databases
    • Philipp Bucher's Eukaryotic Promoter Database (EPD)
    • NCBI Reference Sequences (mRNA sequences)
  • GenBank sections
    The sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, and Viral are available.

In case you have selected a section from the GenBank database you may also restrict the analysis to sequences containing user-defined keywords in their annotation. You can enter keywords which will be searched in

  • the keyword line of the annotation
  • the description line of the annotation
  • the complete annotation

The keyword searches can be combined with "AND" or "OR". Please note that the keywords cannot contain blanks (all blanks will be skipped).

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Hint! HINT: If you want to check a VERY SHORT oligo, please enter the sequence padded with a few Ns at the beginning and end!!!

Promoter Finding in MatInspector

Please note, that this option only works in combination with a valid "Genes & Genomes" account.

MatInspector 8.0 introduced the possibility to submit promoter sequences from Genomatix' ElDorado database directly to MatInspector.
There are two ways to achieve this:

1. Using the gene input combo box

Gene Name Input
Search promoters by gene

The textfield to enter a gene name is a combo box. This means that, while you type a gene name within the field, a drop-down list appears. You then can select your gene of interest from that list by left-clicking the item.

Please note that it may take a few moments before the drop-down list appears. The list items have to be computed and sent to your browser while you are typing. This may depend on your client machine and internet connection.

Should you experience any rendering problems with the drop-down list, please have a look at our Technical FAQs.

Gene combo box

2. Mapping your sequence

Sequence Input
...
Search corresponding promoters for your sequence(s)

To activate the mapping, simply check the box in this field. Of course, you must use one of the sequence input options as well.
Choose an organism from the drop-down list.

Map sequence option

If you submitted a gene, all promoters of this gene are extracted directly from the ElDorado genome database. If you submitted sequences, they are mapped against the selected genome in the ElDorado database. The exon/intron structure of the mapped sequences is compared to all transcripts annotated for the corresponding genomic region. The promoters of all transcripts with at least one exon identical to one of the mapped exons match your query.
On the search result page, you may select the promoters for analysis with MatInspector.

Promoter selection

Some notes on promoter finding:


Matrix Parameters

Depending on the selected MatInspector library a form with more parameters to fill in will appear:

Matrix Search Parameters
Library version Here you can select a previous version of the matrix library. This can be helpful for re-producing old results.
By default, the latest matrix library is selected (please see the Library Statistics and the Library Release Notes).

Note: Certain parameter settings require that the Matrix library is automatically reset, disregarding your selection. The assignment of genes to transcription factors (in MatBase) depends on the ElDorado database version, i.e. to any Matrix library version corresponds exactly one ElDorado version. The ElDorado version however, is taken into account for the literature-based lines of evidence. Literature analysis is available for the following organisms: all vertebrates, both yeasts, fruitfly and thale cress. This means that, if the literature-based lines of evidence are available, the matrix library which corresponds to the current ElDorado database must be selected. In particular, this is required if
  • you use the gene input option
  • the input gene is from one of the organisms listed above
  • your matrix selection contains a matrix from a matrix group corresponding to this organism (i.e. the "Vertebrates" section for a vertebrate organism, the "Insects" section for fruitfly, the "Fungi" section for yeast, the "Plants" section for thale cress)
  • you choose the lines of evidence option
These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Matrix group The MatInspector matrix library consists of carefully selected descriptions for transcription factor binding sites.

The matrix library is divided into the subsections/groups

  • Fungi
  • Insects
  • Plants
  • Vertebrates
  • Miscellaneous (e.g. Bacteria) and
  • General Core Promoter Elements (e.g. TATA box)
You can either select all matrices from one section or select a subset of individual matrices (or matrix families, depending on the "Matrix family"-selection) from each group via another page which will appear after you submit your query.
When selecting a subset of matrices, the core and matrix thresholds can also be set individually for each selected matrix.

A selected subset of matrices can also be saved in a personal directory and can be retrieved via the "use previously defined matrix subsets"-option.

Note, that the list of previously defined subsets depends on the "MATRIX family"-selection! (There is a difference between matrix family subsets and individual matrix subsets.)


Using the link "Check transcription factor <-> matrix family assignment" in the left column you can look up which transcription factor binding sites are represented by which matrix families. You can either enter the name of a transcription factor or the name of a matrix / matrix family from the current MatInspector library.
  • If you enter a transcription factor (e.g. SP1) you will receive the GeneIDs of the corresponding genes and the matrix families that represent the DNA binding site of this transcription factor.

  • If you enter the name of a matrix / matrix family (e.g. V$SP1F) you will receive the GeneIDs of all transcription factors that are assigned to this matrix family.
Matrix families Each matrix belongs to a so-called matrix family, where functionally similar matrices are grouped together, eliminating redundant matches by MatInspector professional.

All matrices in a family are of the same (uneven) length and have an anchor position assigned which is the center position of the matrix. This assures that matrices of a family match exactly at the same position.

If matrix families are selected, MatInspector will only list the best match from a family for each site. Otherwise (individual matrices selected) different but closely related matrices might match at the same position on the sequence (example).

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Core similarity The "core sequence" of a matrix is defined as the (usually 4) highest conserved positions of the matrix.

The maximum core similarity of 1.0 is only reached when the highest conserved bases of a matrix match exactly in the sequence.
Only matches that contain the "core sequence" of the matrix with a score higher than the core similarity are listed in the output.

Increasing the core similarity will miss matches that have one or more mismatches in the core region but have a high similarity to the rest of the matrix (This should only be done to enhance the performance of MatInspector.)

Decreasing the core similarity (while retaining the same matrix similarity) might give a few more matches in the output that have more mismatches in the core region of the matrix.

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Matrix similarity The matrix similarity is calculated as described in the MatInspector papers.

A perfect match to the matrix gets a score of 1.00 (each sequence position corresponds to the highest conserved nucleotide at that position in the matrix), a "good" match to the matrix usually has a similarity of > 0.80. Mismatches in highly conserved positions of the matrix decrease the matrix similarity more than mismatches in less conserved regions.

Increasing the matrix similarity will find less matches in your sequence, but might miss matches that do have a "mismatch" compared to the matrix.

Decreasing the matrix similarity will find more matches in your sequence.

The matrix similarity is correlated to the re-value of a matrix: A matrix with a high re-value will find more matches even with a high matrix similarity than a well-defined matrix (low re-value).

Since there are binding sites that are biologically quite "loosely" defined, a high re-value is not necessarily a sign of a "bad" matrix description. A very low re-value might even be a sign of a description that is too strict.

Optimized matrix similarity:
Thresholds that minimize false positives for each individual matrix are supplied with our library and can be selected from the pull-down menu (example).

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!

IUPAC String Parameters

MatInspector can also perform searches for user-defined IUPAC strings or strings from predefined IUPAC-libraries instead of matrices:

IUPAC String Parameters
User-defined
IUPAC string:
MatInspector will locate matches to this user-defined IUPAC string. Only the IUPAC symbols ABCDGHKMNRSTUVWY can be used (e.g. R is A or G), all other letters are ignored.
Please specify the maximum number of mismatches that are allowed in matches to the string (these can occur at any position of the string). The number of mismatches should not exceed 50% of the string-length.
Predefined
IUPAC library:
If IUPAC families are selected, MatInspector will only list the best match from a family for each IUPAC family. Otherwise (individual IUPACs selected) a single site might match different but closely related IUPAC strings.

The IUPAC libraries provided are

  • Retroviral Primer Binding Sites (tRNA segments)
  • Restriction Sites (several libraries)
  • Plant TF sites (based on PLACE)
    PLACE has been constructed and maintained at NIAS (National Institute of Agrobiological Sciences). PLACE is available free of charge at http://www.dna.affrc.go.jp/htdocs/PLACE/. Identical and similar TF binding sites of PLACE have been grouped into IUPAC families and very unspecific binding sites (e.g. CANNTG) have been removed.
You can either select all IUPAC strings from one library or select a subset of individual strings (or IUPAC families, depending on the "IUPAC family"-selection) from each group via another page which will appear after you submit your query.

A selected subset of IUPAC strings can also be saved in a personal directory and can be retrieved via the "use previously defined IUPAC subsets"-option.

Note, that the list of previously defined subsets depends on the "IUPAC family"-selection! (There is a difference between IUPAC family subsets and individual string subsets.)


Output Parameters

Output Parameters
Lines of evidence You may set some options for lines of evidence:
  • Show additional lines of evidence:
    With this option the lines of evidence are computed.

  • Check for user-defined models:
    This option allows any user-defined models to be searched for when the model match line of evidence is computed. If you do not select the "Show additional lines of evidence", this option is ignored.
    If you do not have user-defined models, this option is not displayed.

There is a limit for the computation for the lines of evidence. For database searches, or if the combined lengths of all input sequences is above 1 million basepairs, the lines of evidence are not available.

Statistics

Depending on this option a the output can be reduced to contain only the match summary table (no graphics, no match details). For database searches it can be interesting to view the statistics only but not the result list as the number of matches to be listed in detail is limited to 5000.

Extra output The following extra output options are available:
  • Show graphics:
    Together with the list of matches, a graphical representation of the result is displayed in an extra tab (see interactive graphics).

    By default, the graphics are shown for your search results. For database searches or if there are too many sequences (more than 50), this option is not available. If the graphics file is too large, your browser might have problems displaying it and/or the interactive features might be too slow.

  • Show matches aligned with sequence:
    With this option, the TF binding site matches are displayed aligned with the sequence (see map output).

    For database searches or if there are too many sequences (more than 50), this option is not available. Also, if a sequence has a length of more than 5000bp, the map is omitted in the output.

  • Allow extraction of matches:
    With this option an additional column is shown in the result table where individual matches can be selected for extraction. The length of the extracted sequence regions can be entered below the match table.
These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Offset for match positions You can supply MatInspector with a number of basepairs that will be added to each position in the output (the number can also be negative).

For example:
If you know the position for the transcription start site (or any position you want to use as the "zero-point" of your sequence), MatInspector can give all matches relative to this position:
if a TSP at position 100 is given, just enter -100 as an offset!

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

These parameters are hidden by default. You can use the reveal box next to the section header to reveal them!

Program Output

MatInspector creates an output file that contains (depending on your parameter settings)

For details on the algorithm or how core and matrix similarity are calculated, please see the algorithm details.

The analysis is terminated if 5000 matches are found as a larger number of matches will result in a huge output file where your browser may crash when displaying it.


Graphical View of matches:

Here is an example output for a matrix search using the vertebrate group of matrices:

common transcription factors

The features of the interactive graphics, e.g. filtering for certain matches, are described here.


Table with Match Details:

Result table
[click the image above for a detailed view]


Color code for matrix similarity column:

A green background in the matrix similarity column marks a similarity above optimized, a red background marks a similarity below optimized (e.g. if a search was started using "optimized - 0.02").

Color code for sequences:

For displaying the sequences which match a matrix in the MatInspector library the following code is used:

Interactive features of the result table

Note: Interactive handling of the result tables is deactivated if there are more than 50 sequences or more than 2500 matches in one sequence or more than 4000 matches in all sequences in the output.


Map:

The map output is displayed if you selected the "Show matches aligned with sequence" option from the output parameters.

(   4)      +MKCCCSCNGGCGn(V$AP2.01(0.932))
(  30)                                +WTGCGTGGGCGKnnn(V$EGR1.01(0.810))
(  43)                                             +nnnnNNTGACGTGnnnnnnnn(V$ATF6.02(0.886))
(  47)                                                 +GNTGACGTGKNNNWT(V$XBP1.01(0.908))
(  45)                                               +ARTNMCYNCNGYSTCAGCWGNTn(V$BEL1.01(0.815))
(  45)                                               +nnnnnnnnRTGASTCAGCAnnnnnn(V$NFE2.01(0.884))
    1     CTGCGCCCTCCGGCCGCCGGTGGCCCTCTGTGCGGTGGGGGAAGGGGTCGACGTGGCTCA
(  34)                                    -nNNGGGGGNGGNNnn(V$ZBP89.01(0.931))
(  59)                                                             -nRNCGYRRTGCATKNTGGGWAAN(V$STAF.01(0.772))


(  84)                          +nNNGTGGGAAANNnn(V$RBPJK.02(0.949))
(  90)                                +nG.AAAGYGAAASYnnnnn(V$IRF2.01(0.805))
( 100)                                          +nnnNNAGKKCCAGGNNMGn(V$PAX6.02(0.955))
   61     GCTTTTTGGATTCAGGGAGCTCGGGGGTGGGAAGAGAGAAATGGAGTTCCAGGGGCGTAA
(  80)                      -nNNGGGGGNGGNNnn(V$ZBP89.01(0.966))
(  96)                                      -nNNWTATTGAYTTNN(V$HNF6.01(0.846))

For each matrix match the IUPAC consensus sequence of the matrix is displayed aligned with the sequence. Matches on the (+) strand are shown above the sequence, (-) strand matches are shown below the sequence.

If a sequence has a length of more than 5000bp, the map output is omitted.


Export of annotated sequence (GenBank format):

The transcription factor binding site matches identified by MatInspector can be exported to GenBank sequence files. MatInspector matches are annotated in the feature table with the feature key "misc_signal".

LOCUS       GXP_287091    766 bp    DNA
DEFINITION  loc=GXL_241328|sym=GCG|geneid=2641|acc=GXP_287091|
            taxid=9606|spec=Homo sapiens|chr=2|ctg=NC_000002|str=(-)|
            start=162716900|end=162717665|len=766|tss=663|
            descr=glucagon|
            comm=GXT_2817146/NM_002054/663/bronze;
             GXT_22755335/ENST00000375497/663/bronze
ACCESSION   GXP_287091
COMMENT     Matrix matches determined by MatInspector (Genomatix)
            Matrix Family Library Version 8.0 (November 2008)
FEATURES             Location/Qualifiers
     misc_signal     complement(14..30)
                     /note="V$FKHD/HNF3B.02, mat_sim: 0.912"
     misc_signal     20..36
                     /note="V$HNF1/HNF1.03, mat_sim: 0.806"
     misc_signal     complement(53..73)
                     /note="V$CART/RHOX6.01, mat_sim: 0.878"
     misc_signal     54..76
                     /note="V$LHXF/ISL2.01, mat_sim: 0.885"
BASE COUNT    268 a  135 c  149 g  214 t
ORIGIN
        1 AGCATCAGCT ATCTTGGATG TTTAATCTTC ATTTTGCTCC ATCCTTTCTG CCTGAATTCC
       61 ATTTATTAAA ACAGAACACA TAGGGGTTTA ATCAATATCC TTAAATTTTC CACAAACATA
      121 ACATAAATAA ACTCCACGTT GTGAGGAAGA GAGGATTTTT AATACATATG TGTTGAATGA
      181 ATGATCATTA TTTAGATAAA TGAATGACTG AAGTGATTGT TATATTCAGG TAAATTCATC
      241 ATGGCTAGGT AGCAAACCAA AGACTTGTAA GAACCTCAAA TGAGGACATG CACAAAACAG
      301 GGATGGCCAT GGGCTACGTA ATTTCAAGGT CTTTTGTCTT CAACGTCAAA ATTCACTTTA
      361 GAGAACTTAA GTGATTTTCA TGCGTGATTG AAAGTAGAAG GTGGATTTCC AAGCTGCTCT
      421 CTCCATTCCC AACCAAAAAA AAAAAAAAAA GATACAAGAG TGCATAAAAA GTTTCCAGGT
      481 CTCTAAGGTC TCTCACCCAA TATAAGCATA GAATGCAGAT GAGCAAAGTG AGTGGGAGAG
      541 GGAAGTCATT TGTAACAAAA ACTCATTATT TACAGATGAG AAATTTATAT TGTCAGCGTA
      601 ATATCTGTGA GGCTAAACAG AGCTGGAGAG TATATAAAAG CAGTGCGCCT TGGTGCAGAA
      661 GTACAGAGCT TAGGACACAG AGCACATCAA AAGTTCCCAA AGAGGGCTTG CTCTCTCTTC
      721 ACCTGCTCTG TTCTACAGCA CACTACCAGA AGGTAAGATG ATTATA
//

Note: Only the currently visible matches are exported. I.e. you can customize the list of matches for export by using the filter feature of the result table(s).


Export of matches to EXCEL™ format:

All information available in the result table (like matrix family name, position, tissue association and lines of evidence) can be exported to a file in Microsoft Excel™ format. You can save this file to your local disk and/or open it directly with Microsoft Excel™ (or other software tools supporting this format, like OpenOffice.org Calc).

Note: Only the currently visible matches are exported. I.e. you can customize the list of matches for export by using the filter feature of the result table(s).


Match Summary Table:

For each matrix (or matrix family) the Match Summary Table lists how many matches were found in total (Match Total), in how many sequences (Common to seq.) and and how often it matches in each input sequence. Additionally, a significance value (see p-value) is given for each common TF site.

Family Statistics

Clicking on one of the column headers will sort the table by this column (ascending or descending as depicted by the little yellow arrows).

The complete table can also be saved in Microsoft Excel™ format (as at most 30 sequences or columns can be displayed in the interactive table).


Lines of evidence explained

With MatInspector release 8.0 came a new main feature: the support of matrix matches by lines of evidence. The following lines of evidence are available:

EvidenceDescriptionRequired settings
Known interaction

Example: GCG is regulated by V$GATA (TF: GATA4)

Each matrix family is associated with at least one transcription factor. If an input gene and one of these transcription factors are known to interact, the matrix family is marked as having this line of evidence.
The Gene<->TF interactions are Genomatix propietary expert curated information based on literature analysis.
  • search for "Transcription factor binding sites" is active
  • the promoter input option is used
  • the input gene is a vertebrate organism, yeast, fruitfly or thale cress
  • the matrix selection contains matrices of a matrix group corresponding to this organism, i.e. the "Vertebrates" section for a vertebrate organism, the "Insects" section for fruitfly, the "Fungi" section for yeast, the "Plants" section for thale cress
  • the lines of evidence option is selected
Known cocitation

Example: GCG is cocited with V$PAX (TF: PAX4, PAX6)

Each matrix family is associated with at least one transcription factor. If an input gene and one of these transcription factors are co-cited in PubMed, the matrix family is marked as having this line of evidence. The co-citation has to occur within the same sentence of a PubMed abstract.
The co-citation information is derived by automatical literature analysis (LitInspector).
Promoter module

Example: part of model match of NFAT_SORY_01

Depending on the matrices/matrix group which are selected for analysis with MatInspector, the input sequences are examined for occurrences of promoter modules of the Genomatix module library, resp. your own user-defined modules (if you selected the "Check for user-defined models" option). Each matrix match which is also part of a model match, is flagged as supporting this line of evidence.
  • search for "Transcription factor binding sites" is active
  • there must be a Module library corresponding to the selected matrix group, in particular the selection must contain vertebrate, plant, or user-defined matrices.
  • the lines of evidence option is selected
Constrained elements

Example: overlaps with constrained element

This evidence is listed, if a match of a transcription factor overlaps with a region of the human genome that is annotated as an evolutionary constrained element (PI-elements, according to Kerstin Lindblad-Toh et al., Nature 478: 476-482, 2011).
The original constrained element data was pre-processed by Genomatix to overlap with promoters but not with exons.
ChIP-Seq data

Example: correlates with Chip-Seq peak of V$STAT (TF: STAT3)

Each matrix family is associated with at least one transcription factor. If a match of a matrix family is found to be overlapping with a ChIP-Seq region for the corresponding TF, this is listed as evidence.
For details on ChipSeq data from ENCODE and further references for other ChipSeq data compiled by Genomatix, please see this page.


References

MatInspector is described in the following publications:

Reference for PLACE (included as IUPAC library into MatInspector):