Genomatix-Logo
Overview of Help-Pages
GEMS Launcher Logo

MatDefine: Definition of weight matrices


[Introduction] [Sequence Selection] [Parameters] [Output] [References]

Introduction

MatDefine is a tool for fully automatic definition and evaluation of weight matrices from a set of short DNA sequences. The resulting weight matrix can be used by MatInspector to scan nucleic acid sequences for matches to the described binding site.

The quality of a matrix is estimated by a value for random expectation (RE-value), which is defined as the number of matches with high matrix similarity (>= 0.85) expected in a random sequence of 1000 bp. This RE-value is assigned to each matrix.

Per default, the weight matrix is generated without any user interaction. A protocol describing the matrix definition process is delivered. The following steps are performed:

All default parameters can be changed. The following additional options are available:

In case of unsuitable input sequences, no matrix will be generated.


Input Data

Sequence or Matrix Input
Sequence Input There are several ways to supply the sequence input data for MatDefine:
  • Choose from your previously uploaded sequences.
  • Enter the sequences directly into the form.
  • If your browser supports this option, an input file can be uploaded.

In all cases the following sequence formats are accepted:

The sequence file may contain at most 5000 sequences.

Matrix Input Nucleotide distribution matrices can either be entered directly into the input form or uploaded from your computer.

The nucleotide distribution matrix consists of four lines. The first line has to contain the numbers of A at each position of the binding site, the 2nd, 3rd, and 4th line the numbers of C, G, T, respectively.
The matrix elements can be separated by any character (except '.').

The following is an example of a nucleotide distribution matrix:

 4  0  0  0  1  7  0 22  0  0  0  0  7  5  4  3  4
 5  4  3  9  8  2 22  0  0  0 22 15  5  4  7  4  4
 2  4  4  2  1  7  0  0  0  0  0  0  2  7  3  6  3
 2  7  9 10 12  6  0  0 22 22  0  7  8  6  8  5  1

In this example matrix elements are separated by blanks.

Parameters
Note: The following parameters are only relevant for sequence input files. If your input is a nucleotide distribution matrix these parameters will be ignored.
Strand optimization The strand optimization option is useful if the orientation of the binding site is unknown. In this case both strands of the input sequences are checked and the "+" or "-" strand is used for matrix definition.

If this option is selected, core-anchored alignment is used automatically.

Alignment
and
Tuple Search
Core-anchored alignment

With core-anchored alignment, the best conserved core-tuple is selected and the alignment is anchored at the first position of the core-tuple in each sequence. The tuple selection algorithm is described in the CoreSearch paper.

The following parameters can be modified:

  • Length of core-tuple:
    The minimum and optimum length of the core-tuple to be searched in the input sequences is 4 bp. If a higher tuple length is selected and no common tuple can be found, smaller tuples are searched until a common tuple can be identified or the minimum tuple length is reached. If matrix generation fails using the input tuple length, the tuple length is increased up to the maximum tuple length of 8 bp until a matrix can be created.
    The core-tuple needs not to be identical in all sequences, it may contain a single mismatch at a random position.

  • Minimum number of sequences containing tuple:
    This is the minimum percent of sequences in which the core-tuple has to be present.
    Sequences for which no core-tuple can be found will be rejected.

  • QuickAlign:
    QuickAlign is an alignment without introducing gaps. It is about three times faster as the default alignment.
    QuickAlign is used automatically if the input consists of more than 100 sequences.
These parameters are hidden by default. Clicking on will reveal them.
Unanchored alignment

Unanchored alignment means that the search for a common core sequence will be omitted. You should use this option if your sequences do not contain a highly conserved core sequence.

These parameters are hidden by default. Clicking on will reveal them.
Matrix Creation Cut off matrix ends

MatDefine automatically determines the correct length of the matrix by cutting off low conserved positions at both matrix ends. For example, if the input sequences are very different in length or contain sequences around the binding site it is necessary to reduce the matrix length.
Turning off this feature makes only sense if you are sure that your input sequences are confined to the binding sites.

These parameters are hidden by default. Clicking on will reveal them.
Remove identical sequences

Identical sequences (i.e. one sequence equals another sequence or is part of another sequence) can be removed to avoid a biased nucleotide distribution matrix. In case of sequences with different length the shorter sequence will be removed.

Regardless of this option, MatDefine always identifies identical sequences in the output file.

These parameters are hidden by default. Clicking on will reveal them.
Calculate optimized threshold

The optimized threshold of a weight matrix is the matrix similarity threshold that minimizes false positive matches when the matrix is used to scan sequences with MatInspector. It is defined in a way that at most 3 matches are found in 10,000 bp of non-regulatory test sequences (i.e. with the optimized threshold less than 3 false positives per 10,000 bp are found).

Since the calculation of the optimized threshold requires some computing time it can be omitted for test runs.

These parameters are hidden by default. Clicking on will reveal them.
Consistency Check Minimum number of sequences

This is the minimum number of sequences which is required to define a matrix. If the input file contains less sequences or less sequences remain after the rejection process no matrix will be created.

These parameters are hidden by default. Clicking on will reveal them.
Minimum matrix similarity

MatDefine generates a weight matrix which is consistent with its training set, i.e. all training sequences have to be identified by the resulting matrix. Therefore, only sequences that reach the minimum matrix similarity are included in the matrix, all other sequences are rejected.

Decreasing the minimum matrix similarity may lead to inclusion of more sequences but also can influence the quality of the matrix.

Increasing the minimum matrix similarity may lead to rejection of more sequences. If too few sequences are retained, no matrix will be created (see minimum number of sequences).

These parameters are hidden by default. Clicking on will reveal them.
Library Comparison Selection of matrix groups

Here, you can select the matrix groups from the current MatInspector library with which the newly generated matrix should be compared. Per default, all matrix groups including your user-defined matrices (if available) are selected.

Please note that the library comparison cannot be completely disabled. If you do not select at least one matrix group, the new matrix will be compared with all available matrix groups.

These parameters are hidden by default. Clicking on will reveal them.
Check all sequences

If this option is set, all input sequences will be checked against the matrix groups selected above (using optimized matrix similarity threshold).
For each sequence you receive the list of matrix families with at least one match in this sequence.

These parameters are hidden by default. Clicking on will reveal them.
Your email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


Program Output

MatDefine creates a protocol detailing each step of the matrix generation, and the weight matrix which is used by MatInspector. The sequence logo of the matrix can be downloaded in various graphics formats, e.g. for inclusion in a scientific publication. The resulting matrix can be saved to your personal matrix library (user-defined library).

The protocol file contains


Example Output

Identical sequences

Sequence Identical to
HSFOS MMCFOS
XLACTIN5A XLACTIN8A


Alignment

Core sequence: CCAT
Number of aligned sequences: 20
Number of rejected sequences: 0

Sequence NamePositionStr.AlignmentMatrix Similarity
MMTFEZIF2
MMTFEZIF1
HSACTCA2
XLACTCAG3
EBV
GGACAREG1
GGACAREG2
HSACTBPR
HSVLC1
MMCYR61G
XLACTIN8A
XLACTIN5A
XLACTCAG1
HSACTCA3
HSACTCA4
XLACTCAG2
MMCFOS
HSFOS
MMKROX1
MMTFEZIF
4 - 23
4 - 23
4 - 23
4 - 23
4 - 23
4 - 23
3 - 22
3 - 22
4 - 23
4 - 23
4 - 23
4 - 23
3 - 22
3 - 22
3 - 22
3 - 22
3 - 22
3 - 22
3 - 22
3 - 22
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
CG CCAT ATAAGGAGCAGGAA
CG CCTT ATATGGAGTGGCCC
GA CCAA ATAAGGCAAGGTGG
TA CCAA ATAAGGGCAGGCTG
AG CCAT ATGTGGACAGATGG
CG CCTT CTTTGGGCAGCGCG
AC CCAA ATATGGCGACGGCC
GT CCTT ATATGGACTCATCT
AT CCTT TTATGGCCCTGTCC
AC CCAA ATATGGAAATATTG
GC CCAT ATTTGGCGATCTTC
GC CCAT ATTTGGCGATCTTC
AT CCCT ATTTGGCCATCCCT
CT CCCT ATTTGGCCATCCCC
TT CCTT ACATGGTCTGGGGG
TT CCAT ACATGGGCTAAGGG
GT CCAT ATTAGGACATCTGC
GT CCAT ATTAGGACATCTGC
GT CCAT ATATGGGCAGCGAC
TC CCAT ATATGGCCATGTAC
0.855
0.901
0.865
0.872
0.941
0.857
0.904
0.914
0.866
0.897
0.955
0.955
0.926
0.940
0.843
0.854
0.945
0.945
0.978
0.988


Additional information



Your new matrix:

Matrix U$srf
Matrix Name: U$srf
Description: not yet available
Family: U$NO_NAME
References: ---
Statistical Basis: 20 sequences
Random Expectation (re-value): 0.02 matches per 1000 bp
Promoter Matches: not available
Optimized Matrix Threshold: 0.77
Length: 21 bp
Nucleotide Distribution Matrix:
Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A 5 2 0 0 13 4 18 0 12 5 0 0 7 2 14 2 4 0 3 1
C 4 5 20 20 2 0 1 2 0 0 0 0 8 13 2 2 8 4 7 10
G 7 4 0 0 0 0 0 0 1 0 20 20 4 5 0 7 8 6 6 7
T 4 9 0 0 5 16 1 18 7 15 0 0 1 0 4 9 0 10 4 2
IUPAC N N C C A T A T W T G G N C A K S K N S
Ci 15.6 21.8 100.0 100.0 46.8 68.9 75.5 79.8 48.8 65.1 100.0 100.0 25.1 46.8 50.2 26.2 34.5 36.0 17.0 32.0
Profile:
100.0
75.0
50.0
25.0

IUPAC:
n n C C A T a t w t g g n c a k s k n s n
  • Basepairs marked red show a high information content, i.e. the matrix exhibits a high conservation (ci-value > 60) at this position.
  • Basepairs in capital letters denote the core sequence used by MatInspector.
Sequence Logo:
Sequence logo for matrix

Download this logo as: png, pdf or eps


Save Matrix

In case you want to save the resulting matrix to your personal library, some more information has to be entered:

Matrix Identification
Matrix Identification The matrix identification consists of
  • a short name identifying the matrix (matrix name)
    and
  • a more detailed description of the binding site (short description).
Family Information Each matrix belongs to a so-called matrix family, where functionally similar matrices are grouped together in order to eliminate redundant matches by MatInspector.

You can

  • create a new family for your matrix by entering the name of the family
    or
  • include your matrix into an existing user-defined matrix family by selecting a family from the list of available matrix families.
Extra Information / References Here you can enter further information which will be stored in the References field of the matrix.


References

MatDefine is described in: