Genomatix-Logo
Overview of Help-Pages
MatBase logo

Matrix


Matrix overview

The matrix overview is what you will get if you select the 'single matrices' category for browsing. It lists all matrices included in MatBase in alphabetical order. The screenshot below shows the first few matrices of the list.

matrix result

Information in the overview:

Matrix name: Lists the names of the matrices in each family. Clicking on the name will take you to a matrix result page.
Matrix information: Similar to the family information, 'Matrix information' is a short description of the specific transcription factor binding site that is found by the matrix.
RE: is the re-value of the matrix, a statistical value explained in detail below.
opt: Lists the optimized threshold of the matrix used by MatInspector. Again, please see the explanation below.

Matrix result

The screenshot below shows the result for a single matrix. Depending on the amount of information available for a matrix the look be slightly different (e.g. if a matrix has been constructed directly from a weight matrix description instead of an alignment of sites).

matrix result


Matrix Name The MatInspector matrices have an identifier that indicates one of the following seven groups
  • vertebrates (V$)
  • insects (I$)
  • plants (P$)
  • fungi (F$)
  • nematodes (N$)
  • bacteria (B$)
  • general core promoter elements (O$)
followed by an acronym for the factor the matrix refers to, and a consecutive number discriminating between different matrices for the same factor. Thus, V$OCT1.02 indicates the second matrix for vertebral Oct-1 factor.
Description Further information for a matrix or matrix family.
Family The matrix family this matrix belongs to.

Clicking on the family name will take you to the 'family result'.

References

References for the original source of sequences/oligonucleotides or weight matrices used for the construction of the matrix with author, title and citation.

Clicking on a reference id will take you to the 'reference result'.

Random expectation (re-value) The re-value for each individual matrix gives an expectation value for the number of matches per 1,000 base pairs of random DNA sequence (that is, it indicates how well a matrix is defined).

Since there are binding sites that are biologically quite "loosely" defined, a high re-value is not necessarily a sign of a "bad" matrix description. A very low re-value might even be a sign of a description that is too strict.

Promoter matches The value given is the percentage of promoters in which a match to the matrix is found with optimized matrix similarity. In order to determine the promoter matches, promoter sequences are extracted from ElDorado. The following promoter sequences are scanned for the different matrix groups:

Starting with MatBase 10.0

  • Vertebrates: 375,000 human, mouse, and rat promoter sequences with an average length of 1184 bp
  • General Core Promoter Elements: 375,000 human, mouse, and rat promoter sequences with an average length of 1184 bp
  • Plants: 82,000 promoter sequences of Arabidopsis thaliana and rice with an average length of 1159 bp
  • Insects: 21,000 promoter sequences of Drosophila melanogaster with an average length of 1120 bp
  • Fungi: 11,800 yeast promoter sequences with an average length of 1105 bp

Up to MatBase 9.4

  • Vertebrates: 366,000 human, mouse, and rat promoter sequences with an average length of 661 bp
  • General Core Promoter Elements: 366,000 human, mouse, and rat promoter sequences with an average length of 661 bp
  • Plants: 70,000 promoter sequences of Arabidopsis thaliana and rice with an average length of 625 bp
  • Insects: 21,000 promoter sequences of Drosophila melanogaster with an average length of 617 bp
  • Fungi: 11,800 yeast promoter sequences with an average length of 603 bp

Matrix matches This table contains the absolute number of matches and the number of matches per 1,000 base pairs of the matrix in the genome and promoter sequences for each species listed. Please note that for some species only the numbers for promoters are given, as there is no completely assembled genome yet.
Optimized matrix threshold This matrix similarity is the optimized value defined in a way that a minimum number of matches is found in non-regulatory test sequences (i.e. with this matrix similarity the number of false positive matches is minimized).

This matrix similarity is used when the user checks "Optimized" as the matrix similarity threshold for MatInspector.

Length Length of the matrix in base pairs. All matrices in a family are of the same (uneven) length.
Nucleotide Distribution Matrix The nucleotide distribution matrix shows the nucleotide frequencies observed in aligned binding sites of the corresponding transcription factor.
Profile The profile of a matrix is a graphical representation of the Ci-vector, i.e. the degree of conservation at each position of the matrix.

The IUPAC string consensus is a representation of the matrix based on the following rules (adapted from Cavener, Nucleic Acids Res. 15, 1353-1361, 1987):

  • A single nucleotide (A,C,G,T) is shown if its frequency is greater than 50% and at least twice as high as the second most frequent nucleotide.
  • A double-degenerate code (R,Y,K,M,S,W) indicates that the corresponding two nucleotides occur in more than 75% of the underlying sequences but each of them is present in less than 50%.
  • A triple-degenerate code (B,D,H,V) is shown if only one of the nucleotides does not appear at all.
  • All other frequency distributions are represented by the letter "N".
Core The core sequence of a matrix is defined as the (usually 4) highest conserved, consecutive positions of the matrix.
Sequence logo A graphical representation of the matrix consensus generated using the algorithm described in
  • Crooks GE, Hon G, Chandonia JM, Brenner SE (2004).
    WebLogo: A sequence logo generator
    Genome Research, 14,1188-90
  • Schneider TD, Stephens RM (1990).
    Sequence Logos: A New Way to Display Consensus Sequences
    Nucleic Acids Res. 18, 6097-100
The weblogo source code is available at http://weblogo.threeplusone.com/
Statistical basis The number of sites the matrix is based on.
Sites used to build the matrix

This shows the alignment of the the sites that have been used to construct the matrix. It shows the names of the sites, the alignment, the matrix similarity score for each site and the reference(s) for the site. The matrix is built from the middle part of the alignment, any heading or trailing nucleotides are discarded in the process.

Clicking on a site name will take you to the 'site result'. Clicking on a reference id will take you to the 'reference result'.

Sites rejected during matrix definition

These are sites that have been published as binding sites for a transcription factor but didn't fit in the overall alignment in the matrix construction process. To get a specific weight matrix description these sites are left out. However they are listed here for completeness.

Clicking on a site name will take you to the 'site result'. Clicking on a reference id will take you to the 'reference result'.

Identical sites (not used for matrix generation)

These are sites with a sequence identical to one that has already been used in the alignment for the matrix. These sites wouldn't add any information to the matrix and are therefore only listed for completeness.

Clicking on a site name will take you to the 'site result'. Clicking on a reference id will take you to the 'reference result'.

Ci-vector The Ci-vector (consensus index vector) for the matrix represents the degree of conservation of each position within the matrix. The maximum Ci-value of 100 is reached by a position with total conservation of one nucleotide, whereas the minimum value of 0 only occurs at a position with equal distribution of all four nucleotides and gaps.

[go back to MatBase overview]