![]() |
Control of gene transcription is commonly used in biological systems to regulate protein expression. Transcriptional regulation in eukaryotes depends upon a series of complex signal transduction networks that ultimately control gene promoter activity via cis-acting elements like enhancers, matrix attachment regions (MARs), locus control regions (LCRs), and trans-acting elements (transcription factors).
Polymerase II promoters are generally defined as the region of a few hundred basepairs located directly upstream of the site of initiation of transcription. (More distal regions and parts of the 5' UTR may also contain regulatory elements and may be part of the promoter). Therefore, identification of the transcription start site directly leads to the location of the promoter of a gene.
Polymerase II promoters are generally defined as the region of a few hundred basepairs located directly upstream of the site of initiation of transcription. More distal regions and parts of the 5' UTR may also contain regulatory elements and may be part of the promoter. The exact length of a promoter can often only be defined experimentally. However, for an initial in silico analysis it may be sufficient (and also necessary) to restrict the region to about 300 to 1000 bp upstream of the transcription start site.
The start of the CDS only corresponds to the translation
start site and gives no hint on the localization of the promoter. Eukaryotic
genes usually have 5' untranslated regions (5' UTRs) of variable length
in a range of a few base pairs up to several kb. The 5' UTR may be split
over several exons. Only the identification of the transcription start
site defines the location of the promoter.
However, if your gene is annotated within the human genome sequence you
can use the CDS nucleotide sequence to map it onto the genome sequence
with ElDorado. This will
directly provide you with information about a promoter for that gene.
Genomatix' PromoterInspector is a tool to predict promoter regions in unannotated genomic sequences. PromoterInspector has a high specificity of about 85 %. The sensitivity of PromoterInspector is about 50 % which means that the current version predicts about every second promoter in the genome. Therefore, your promoter may not be found.
PromoterInspector predicts the approximate location of a promoter region and not the exact location of the TSS. The predicted regions may contain the promoter or overlap with the promoter. The strand orientation of the predicted promoter region can only be derived from the location of the corresponding gene.
PromoterInspector predicts promoter regions by identification of the conserved promoter context independently of the occurrence of specific elements like CCAAT or TATA boxes.
To identify transcription factor binding sites in a promoter you can use GEMS Launcher as described in 4.1.
There are several reasons why this may occur. Mapping
depends on the quality of the input sequence. E.g. EST sequences may pose
problems because they are derived from single sequencing runs which may
contain sequence errors or artifacts.
In addition the assembly and annotation of the human genome is an ongoing
process. Your gene may not yet be contained within the currently assembled
sequences.
There are cases where indeed overlapping transcripts exist on both DNA strands. You must have additional knowledge about your input to assign it to a transcript. However, there are also cases where a transcript was mislocated by automated genome annotation. Many of the automated locus annotations (LOC ... entries) still await qualified human review.
The ElDorado output gives an overview of the region where
a mapping for your input was found. This includes annotated sequences that
were found within 5 kb upstream and downstream of the annotated sequence
to which the input was mapped. It may be that your input mapped to a provisional
gene annotation (i.e. locus ID with LOC... signature) or to a region without
annotation with a well annotated gene nearby.
ElDorado can provide additional information for the annotated gene but
not for a provisional result from automated genome annotation.
Generally, it is advisable that before starting additional analysis you
go to the ElDorado graphical output and identify your input in that map.
This depends on several factors: the current state of the human genome sequence and its annotation, the knowledge about your input sequence, and the quality of your input sequences. If Gene2Promoter finds a promoter for your input it also gives you a quality assessment for it (see question 3.5).
gold = experimentally verified (transcript derived from
mapping of full length cDNAs)
silver = supported by PromoterInspector prediction at the 5' end of the
transcript
bronze = transcripts without additional evidence for their completeness
This is up to you and your knowledge about the mapped mRNA. The crucial question is whether the underlying cDNA sequence is 5' complete. Many entries in EMBL/GenBank only focus on coding regions and are probably 5' incomplete.
This depends on your knowledge about the mapped input sequence. If you map a known gene you almost always should find the gene name in the output. If you used an unknown sequence fragment first look for named genes in the output and see how those genes fit to your initial questions or experimental conditions. The Gene2Promoter output has links to ElDorado where you can get an overview and additional information of the loci where your input was mapped.
The sequencing of the human genome is not yet complete. There are still gaps within the assembled sequences which are annotated as runs of 100 Ns. Gene2Promoter avoids extracting those runs of Ns and only extracts putative promoters up to that point. You also should get a warning telling you that promoter extraction was incomplete.
ESTs are derived from mRNAs. However, ESTs are generated by high throughput methods and the sequence information obtained is often erroneous. It may happen that the EST contains too many errors to allow a quality alignment to a genomic sequence. It may also happen that an EST does not overlap with a known annotated transcript. In that case Gene2Promoter cannot extract a promoter.
GEMS Launcher comes with a large library of weight matrices representing transcription factor binding sites which can be searched in DNA sequences. Choose the task "Search for TF sites" in the GEMS Launcher category "Analyze your sequences".
There are different possibilities:
Transcription factors usually bind in a defined orientation to the DNA double helix. This orientation depends on the orientation of the DNA sequence they recognize, i.e. their transcription factor binding site. An exception are factors that recognize symmetric or palindromic sites. In this case the factor can bind principally in both orientations.
Some transcription factor binding sites must have a defined orientation relative to the promoter or the transcription start site, an example is the TATA-box. Most transcription factor binding sites can occur in both orientations in promoters or enhancers.
Therefore, for the TATA-box only the (+)-strand matches of GEMS Launcher should be considered as true positives (if the strand orientation of the promoter sequence analyzed is known and is in 5'-->3' orientation relative to the gene). For most other transcription factor binding sites both, (+)- and (-)-strand matches of GEMS Launcher should be considered equally.
In addition, there is a technical aspect that has to be considered. Transcription factor binding sites are represented by weight matrices (or IUPAC strings). Each matrix has a strand orientation which depends on the strand orientation of its training sequences used. Therefore, a matrix match on the (+)-strand only means that the matching sequence has the same strand orientation relative to the training sequences used for the matrix generation (and vice versa).
You have to compare the value of 0.81 to the optimized matrix similarity defined for the weight matrix that represents the transcription factor binding site. Highly specific or relatively long (more than 25 bp) matrices usually have a lower value for the optimized matrix similarity (e.g. 0.77) than less specific or shorter matrices (which may have an optimized matrix similarity of e.g. 0.93). Therefore, if your value of 0.81 is higher than the defined optimized matrix similarity you have a good match (and vice versa).
One feature of the matrix library is the integration of individual matrices into matrix families. A family consists of matrices that represent similar DNA patterns or transcription factor binding sites with a similar biological function. The family concept leads to a significantly reduced output. Redundant matches are eliminated, because only the best match within a family is listed. If you are interested in individual sites for a factor select "matches to individual matrices".
The matrix similarity is the score of the complete matrix match (the more important value), the core similarity is the score of the highest conserved positions of a matrix match (keep the default core similarities unless for special cases and you are an expert). Both threshold have to be reached for a matrix match.
Choose GEMS Launcher task "Definition of weight matrices" in the category "Patterns/libraries".
The occurrence of a single transcription factor binding site found by GEMS Launcher does not give you any hint that this site may also be functional. Functionality is determined by the sequence context. If a binding site is part of a framework of two or more sites there is a stronger evidence that the individual sites may be functional. Therefore, identification of single matrix matches is usually only a first step in promoter analysis, the subsequent aim will be the identification of more complex promoter models.
Comparative sequence analysis is based on the hypothesis that functional sites are preserved during evolution and therefore higher conserved than the sequence context. This provides an opportunity to identify the functional sites directly from a training data set of at least two (seven to twenty would be better) different but functionally related sequences. A suitable training data set may contain orthologous or homologous promoters or, for instance, promoters from co-regulated genes found by expression array analysis.
A promoter model represents a framework of two or more conserved elements (e.g. transcription factor binding sites) with a defined distance (and strand orientation).
If you have only one promoter sequence you can use GEMS Launcher to scan for matches to the transcription factor binding site matrix library and the promoter module library. However, if you are able to find additional orthologous or co-regulated promoters you can expand the training data set which opens the way to more sophisticated strategies like comparative sequence analysis.
The basic hypothesis is that co-regulated promoters consist of a similar framework of two or more functionally conserved elements (e.g. transcription factor binding sites). One of our strategies to identify new target genes is the generation of models that specifically describe those frameworks.
If biological data about transcription factors involved in regulation of the genes are available you can directly generate models using GEMS Launcher task "Definition of models". Alternatively, you can start with a training data set consisting of two or more co-regulated (or orthologous) promoters to perform the GEMS Launcher analysis "Definition of common framework". (Note: if you start with a CDS or expression array data you have to identify the promoter of the genes, first). The new model can be used to scan the databases for potential new target genes.
You can use the database entries only if the promoters or the transcription start sites are annotated to generate a training data set consisting of the gene promoters. How to identify promoters in unannotated sequences is described in 2.1.
No, the existence of a generic promoter model is impossible due to biology. Regulation of gene expression is a highly specific task that requires highly specific promoters and, therefore, specific promoter models.
Choose the GEMS Launcher task "Design of regulatory sequences (SequenceShaper)".
| © 1998-2010 Genomatix Software GmbH - All rights reserved |