Genomatix-Logo
Overview of Help-Pages

Frequently asked scientific questions


[Technical FAQ]

1. General questions

  1. Which elements may be involved in regulation of gene transcription?

    Control of gene transcription is commonly used in biological systems to regulate protein expression. Transcriptional regulation in eukaryotes depends upon a series of complex signal transduction networks that ultimately control gene promoter activity via cis-acting elements like enhancers, matrix attachment regions (MARs), locus control regions (LCRs), and trans-acting elements (transcription factors).

  2. What is the difference between all ElDorado versions?

    Differences are for example the available genomes, the genome builds, and the transcripts that have been mapped to the genomic sequences. For details, please see the "Available Genomes" help page.

2. Promoter identification and extraction

  1. How can I identify the promoter sequence of a gene?

    Polymerase II promoters are generally defined as the region of a few hundred basepairs located directly upstream of the site of initiation of transcription. (More distal regions and parts of the 5' UTR may also contain regulatory elements and may be part of the promoter). Therefore, identification of the transcription start site directly leads to the location of the promoter of a gene.

  2. What is the length of a promoter?

    Polymerase II promoters are generally defined as the region of a few hundred basepairs located directly upstream of the site of initiation of transcription. More distal regions and parts of the 5' UTR may also contain regulatory elements and may be part of the promoter. The exact length of a promoter can often only be defined experimentally. In Eldorado, promoter sequences are defined as 500 bp upstream of the transcription start site and 100 bp downstream of the transcription start site.

  3. I have the complete coding sequence (CDS) of a gene. Can I use this information to identify the promoter?

    The start of the CDS only corresponds to the translation start site and gives no hint on the localization of the promoter. Eukaryotic genes usually have 5' untranslated regions (5' UTRs) of variable length in a range of a few base pairs up to several kb. The 5' UTR may be split over several exons. Only the identification of the transcription start site defines the location of the promoter.
    However, if your gene is annotated within a genome available in ElDorado you can use the CDS nucleotide sequence to map it onto the genome sequence. This will directly provide you with information about a promoter for that gene.

  4. What is the exact meaning of gold, silver, and bronze transcripts?

  5. Shall I trust an extracted upstream region as a promoter (from a bronze transcript)?

    This is up to you and your knowledge about the mapped mRNA. The crucial question is whether the underlying cDNA sequence is 5' complete. A further evidence is the conservation of the promoter in orthologous loci. This information is given in the Comparative Genomics output of ElDorado.

3. Identification of transcription factor binding sites

  1. How do I search for transcription factor binding sites?

    MatInspector comes with a large library of weight matrices representing transcription factor binding sites which can be searched in DNA sequences. Start MatInspector from the Genomatix Suite main menu or choose the task "Pattern Search & Analysis -> MatInspector" in the navigation bar.

  2. I have a promoter sequence, but MatInspector did not find the TATA box!

    There are different possibilities:

  3. There is a known transcription factor binding site in my promoter, however, it is not found by MatInspector!

  4. MatInspector also found transcription factor binding sites on the (-)-strand of my promoter sequence. What is the difference between (+)- and (-)-strand matches?

    Transcription factors usually bind in a defined orientation to the DNA double helix. This orientation depends on the orientation of the DNA sequence they recognize, i.e. their transcription factor binding site. An exception are factors that recognize symmetric or palindromic sites. In this case the factor can bind principally in both orientations.

    Some transcription factor binding sites must have a defined orientation relative to the promoter or the transcription start site, an example is the TATA-box. Most transcription factor binding sites can occur in both orientations in promoters or enhancers.

    Therefore, for the TATA-box only the (+)-strand matches should be considered as true positives (if the strand orientation of the promoter sequence analyzed is known and is in 5'-->3' orientation relative to the gene). For most other transcription factor binding sites both, (+)- and (-)-strand matches should be considered equally.

    In addition, there is a technical aspect that has to be considered. Transcription factor binding sites are represented by weight matrices (or IUPAC strings). Each matrix has a strand orientation which depends on the strand orientation of its training sequences used. Therefore, a matrix match on the (+)-strand only means that the matching sequence has the same strand orientation relative to the training sequences used for the matrix generation (and vice versa).

  5. MatInspector found a transcription factor binding site with a matrix similarity of 0.81. Is this a good match?

    You have to compare the value of 0.81 to the optimized matrix similarity defined for the weight matrix that represents the transcription factor binding site. Highly specific or relatively long (more than 25 bp) matrices usually have a lower value for the optimized matrix similarity (e.g. 0.77) than less specific or shorter matrices (which may have an optimized matrix similarity of e.g. 0.93). Therefore, if your value of 0.81 is higher than the defined optimized matrix similarity you have a good match (and vice versa).

  6. How are matrix families defined in MatInspector?

    One feature of the matrix library is the integration of individual matrices into matrix families. A family consists of matrices that represent similar DNA patterns or transcription factor binding sites with a similar biological function. The family concept leads to a significantly reduced output. Redundant matches are eliminated, because only the best match within a family is listed. If you are interested in individual sites for a factor select "matches to individual matrices".

  7. Why are there different matrices for the same factor? Which one should I use?

  8. I run an analysis using the matrix family describing the binding sites of the TF I am interested in. The result does not contain any entry for this TF, only other matrices of this family are shown in the result. Does this mean that no potential binding sites for my TF are found?

    The transcription factors assigned to the same family have very similar binding sites that cannot be distinguished by computational methods like MatInspector. It does not matter which individual matrix is given as best family match in the MatInspector output. All transcription factors assigned to the family can physically bind to the predicted site. Which TF of the family binds in vivo depends on the biological context, e.g. which alternative factors are present and available for binding in which relative concentrations, or, which other, interacting factors are present and bind in the vicinity, stabilizing binding at the site of interest. For example, different GATA factors will bind interchangeably to the same sites but are present in different cell lineages.

    Details on MatInspector and the matrix family concept can be found in the publication

  9. What is the difference between core and matrix similarity?

    The matrix similarity is the score of the complete matrix match (the more important value), the core similarity is the score of the highest conserved positions of a matrix match (keep the default core similarities unless for special cases and you are an expert). Both threshold have to be reached for a matrix match.

  10. I have several different binding sites (e.g. footprints, oligos) for a transcription factor. How can I define a pattern description for all these sites?

    Choose the task "Pattern Definition -> MatDefine" in the navigation bar of Genomatix Suite.

  11. I am trying to determine the potential in silico binding sites for my gene of interest and MatInspector found several hundred binding site matches. Are there any resources to help better determine functional binding sites, or at least to try to narrow down this list to more specific matches?

    The "Additional Line of Evidence" column of the MatInspector output is a good place to look for information about functionally significant transcription factor binding sites. Some additional information about this level of output, which includes known TF interactions, known co-citations, experimentally verified promoter modules, constrained elements, and ENCODE ChIPSeq data can be found under "Lines of evidence explained" in the MatInspector Help document.

    Cross-species comparisons of promoters (also known as promoter sets) can help to identify conserved regulatory elements which are more likely to be funtional. This can be carried out via extraction of the promoters with Gene2Promoter, followed by analysis with DiAlign TF.

    Functionality is determined by the sequence context. If a binding site is part of a framework of two or more sites there is a stronger evidence that the individual sites may be functional. Such complex promoter models can be identified by analyzing promoters of similarly regulated genes or promoters of orthologous genes with FrameWorker.

  12. I have a consensus binding sequence for a transcription factor. I want to see if this consensus binding site is in the promoter of a handful of genes shown to be up/down-regulated in one of our experiments. What is the best way to go about this with your software?

    The best way is to first extract the promoter sequences of the genes you are interested in with Gene2Promoter. Enter the list of your genes (e.g. gene symbols or gene IDs) and save the promoter sequences to the project management of your account. Then use MatInspector to search for the consensus binding site in the extracted promoter sequences. In MatInspector, select the sequences you have previously saved and select "User-defined IUPAC string" as Library. On the following page you will be asked for the "User-defined IUPAC string". Enter your consensus binding sequence and press "Submit Query".

    An alternative to the IUPAC search for your consensus binding site would be to check whether the Genomatix matrix library contains a weight matrix description for your transcription factor and use this matrix/matrix family for the search. Enter the name of the transcription factor into the search input field of MatBase and restrict the search to "Matrix families". Then you will get the information which matrix family describes the binding sites of this TF. For the binding site search with MatInspector, use the default Library Selection and set the Matrix group parameter "continue with subset definition from selected groups". This way you are able to select individual matrix families for the binding site search.

4. Identification of promoter models

  1. What is comparative sequence analysis?

    Comparative sequence analysis is based on the hypothesis that functional sites are preserved during evolution and therefore higher conserved than the sequence context. This provides an opportunity to identify the functional sites directly from a training data set of at least two (seven to twenty would be better) different but functionally related sequences. A suitable training data set may contain orthologous or homologous promoters or, for instance, promoters from co-regulated genes found by expression array analysis. Genomatix programs for comparative sequence analysis are Common TFs, Overrepresented TFBS, DiAlign TF, and FrameWorker. FrameWorker is able to generate sequence models that can be searched in other sequences with ModelInspector. All these programs are included in the "Gene Regulation" package.

  2. What can I do with a promoter model?

    A promoter model represents a framework of two or more conserved elements (e.g. transcription factor binding sites) with a defined distance (and strand orientation).

  3. I only have a single promoter sequence. How to perform a functional sequence analysis?

    If you have only one promoter sequence you can use MatInspector and ModelInspector to scan for matches to the transcription factor binding site matrix library and the promoter module library. However, if you are able to find additional orthologous or co-regulated promoters you can expand the training data set which opens the way to more sophisticated strategies like comparative sequence analysis.

  4. I have a promoter/ a CDS/ expression array data. How to identify other target genes?

    The basic hypothesis is that co-regulated promoters consist of a similar framework of two or more functionally conserved elements (e.g. transcription factor binding sites). One of our strategies to identify new target genes is the generation of models that specifically describe those frameworks.

    If biological data about transcription factors involved in regulation of the genes are available you can directly generate models using the GEMS Launcher task "FastM: Definition of models". Alternatively, you can start with a training data set consisting of two or more co-regulated (or orthologous) promoters to perform the GEMS Launcher analysis "FrameWoker: Definition of common frameworks". (Note: if you start with a CDS or expression array data you have to identify the promoter of the genes, first). The new model can be used to scan the databases for potential new target genes.

5. Design of mutation experiments

  1. How can I specifically delete/insert/modify a transcription factor binding site in my promoter sequence?

    Choose the GEMS Launcher task "SequenceShaper: Design of regulatory sequences".

6. Characterization of gene sets

  1. I want to export the GePS ranking results for my gene set. How can I do this?

    The detailed ranking results can be exported from the GeneRanker result. Please click on the yellow X icon in the lower right corner of the GePS output and select "GeneRanker results" from the pop-up menu. Then a new window is opened where the GeneRanker results are displayed. Here, all ranking results can be downloaded (as Excel file or in tab-delimited format).