![]() |
CoreSearch is a tool to define unknown common motifs in a set of unaligned DNA sequences. CoreSearch starts with a search for a highly conserved core sequence (called "tuple" in the original publication) which occurs in almost all of the input sequences. In most cases this initial search defines more than one core. Consecutive selection steps are employed in order to reduce the number of core candidates as soon as possible. The selection is based on maximization of the information content (consensus index), first of the core and then of regions around the core.
CoreSearch has been considerably improved compared to the version published in 1996. Most important improvement is the running time. The new CoreSearch version is 10 times faster than the old version.
New in CoreSearch 6.0 (April 2009):
CoreSearch accepts an unlimited number of input sequences (e.g. from ChIP-Seq experiments). In case more than 250 sequences are uploaded to CoreSearch, a new fast algorithm is used for definition of the common motif. Definition of a motif in about 2000 ChIP-Seq regions usually takes less than 1 minute.
The follwing steps are performed by the "large-scale" CoreSearch version:
| Sequence Input | |
|---|---|
| Choose from your previously uploaded sequences | Select a sequence file from the list of your personal sequence files. |
| or enter the formatted DNA sequence(s) | Enter your correctly formatted sequence(s) directly into the
form, e.g. with copy and paste. The following formats are accepted: There should be only IUPAC characters in the sequence, any other characters will be skipped! |
| or upload a file containing sequence(s) (max. 100 MB) | If your browser supports this option, a sequence file can be uploaded. If you use this option, the file should contain the sequence(s) in either one of the following formats: Please note, that the size for uploaded files is limited to 100MB. If you want to analyze larger sequences please contact support@genomatix.de. For whole chromosomes you can use the accession number option below (e.g. 'NC_000001' for human chromosome 1). |
| or enter accession number(s) |
If you are interested in one or several special
sequences from a database section, you can supply a list of correct accession
numbers in the form. If you want to select more than one accession number,
please separate the accession numbers by commas or spaces.
On the Genomatix server accession numbers from the following databases can be entered:
|
| CoreSearch Parameters | |
|---|---|
| Length of core | Length of the highly conserved
core sequence that is searched initially. The core sequence needs not to
be identical in all sequences, it may contain a single mismatch at a random
position.
The default length of 7 bp is suitable for most sequence sets. In case the sequences are longer than 1000 bp, it may be necessary to increase the length of the core to avoid random matches and to speed-up the run time. |
| Minimum number of sequences | The lower limit of sequences within the input set that has to contain the core sequence and the final motif. Default is the absolute number of sequences that corresponds to at least 75% of the input sequences. |
| Number of motif matches per sequence | By default, CoreSearch looks for at most one motif match per sequence. In case you assume that the sequences might contain more than one match of the same motif, change this parameter to "any number of repetitions". Note: In case the input set includes more than 250 sequences (i.e. the "large-scale" CoreSearch algorithm is used), this parameter can only be set to "at most one motif match per sequence". |
| Strand | CoreSearch searches for motifs on both the given strand and the reverse complement strand by default. Setting the "Strand" parameter to "search only top strand" will cause CoreSearch to search the given strand only. |
| A priori frequency of nucleotides | By default, CoreSearch uses the nucleotide
distribution of the input sequences as a priori frequency. Alternatively,
equal distribution of all nucleotides can be used.
These parameters are hidden by default. You can use the
|
| Min. matrix similarity | CoreSearch defines motifs described
as weight matrices. All motif matches have to be identified by the resulting
matrix. Therefore, only sequences that reach the minimum matrix similarity
are included in the matrix, all other sequences are rejected.
In case CoreSearch is not able to identify a common motif in the input sequences, it may help to decrease the matrix similarity. These parameters are hidden by default. You can use the
|
| Max. number of motifs | CoreSearch will look for up to this number of distinct motifs in the input sequence set. CoreSearch stops when this number of motifs has been found, or when no more motif is found that fulfills the search criteria. Note: In case the input set includes more than 250 sequences (i.e. the "large-scale" CoreSearch algorithm is used), only one motif can be defined. |
| Email address | Here you can choose between two methods for receiving
the results:
The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management! |
Motif 1: Core CCAATCA detected in 12 sequences, number of final matches: 17
| Sequence Name | Position | Str. | Alignment | Matrix Similarity |
|---|---|---|---|---|
| E-ALPHA MUSMHKBA MSV HUMHB1AZ ADMLP HUMHSP70 rnalobg1 musalbprmt E-BETA mmlapglob XLHSP70 SUSH2B1G_2 HUMHSP70 rnalobg1 musalbprmt XLHSP70 SUSH2B1G_2 |
62 - 76 281 - 295 151 - 165 343 - 357 82 - 96 391 - 405 110 - 124 65 - 79 239 - 253 162 - 176 232 - 246 156 - 170 307 - 321 92 - 106 104 - 118 316 - 330 188 - 202 |
(+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (-) (-) (+) (+) |
TTAA CCAATCA GAAA AGAA CCAATCA GTGT CTAA CCAATCA GTTC CCAG CCAATGA GCGC TAAA CCAATCA CCTT TGGA CCAATCA GAGG CGCG CCAATCA GAGT GGAA CCAATGA AATG GGAG CCAATCA GCAT CCAG CCAATGA GAAC TTAG CCAATCA AGGC CCTA CCAATCA AAGC TGAG CCAATCA CCGA ACAG CCAATCT TTGT TTAA CCAATAA CTGT TTAG CCAATCA GCAA GTAG CCAATCA GAAA |
0.986 0.989 0.977 0.956 0.965 0.937 0.941 0.929 0.986 0.949 0.982 0.924 0.986 0.873 0.937 0.988 0.987 |
| Conservation profile | *****
*****
*****
***** *
***** *
***** *
***** *
***** *
***** *
** *******
** *******
** *******
** *******
** ******* *
** ******* * *
** ******* * *
**** ******* ***
**** ******* ****
**** ******* ****
**** ******* ****
| |||
| IUPAC consensus | NNAR CCAATCA GNRN | Re-value: 0.04 | ||
Please note that CoreSearch detected two motif matches in five of the input sequences (HUMHSP70, rnalobg1, musalbprmt, XLHSP70, and SUSH2B1G_2). The motif found is similar to all matrix families describing the binding sites of CCAAT box binding proteins.
| Motif | Re-value | IUPAC consensus |
|---|---|---|
| U$s1_STAT1_min100_read | 0.70 | NTTCCAGGAAN... |
| U$s2_STAT1_min100_read | 0.20 | NTTCYRGGAANNGN |
| U$s3_STAT1_min100_read | 1.29 | NTTCCAGGAAN... |
| U$s4_STAT1_min100_read | 0.80 | NTTCCAGGAAN... |
| U$s5_STAT1_min100_read | 0.77 | NTTCCAGGAAN... |
Average similarity of motifs: 0.648
At least one motif match found in 2419 of 2428 sequences.
| Number of aligned sequences: | 2302 |
| Number of rejected sequences: | 117 |
| Sequence Name | Position | Str. | Alignment | Matrix Similarity |
|---|---|---|---|---|
|
Region_3 Region_4 Region_5 Region_6 Region_7 Region_8 Region_10 Region_11 Region_12 Region_14 Region_15 ... |
274 - 286 419 - 431 523 - 535 587 - 599 447 - 459 410 - 422 402 - 414 372 - 384 330 - 342 873 - 885 437 - 449 ... |
(-) (+) (-) (-) (+) (-) (+) (-) (-) (-) (-) ... |
GCTTCCAGGAAGT ACTTCTCGGAAAT ATTTCCAGTAAAC ATTTCTGGGAAAA TGTTCTGGGAATT GCTTCCGGGAAAT AATTTCAGGAAAT CCGTCCACGAAAG GATTTCAGGAACA GTTTCTAGGAAAC ATGTCCAGGAAAA ... |
0.893 0.877 0.923 0.904 0.955 0.903 0.829 0.883 0.952 0.906 0.844 ... |
| Conservation profile | **
**
**
***
****
* ****
*** ****
*** ****
*** *****
*********
*********
*********
*********
*********
*********
***********
*************
*************
*************
| |||
| IUPAC consensus | NNTTCCAGGAANN | Re-value: 0.68 | ||
The motif found correctly describes a STAT1 binding site (two 4 bp half sites separated by a spacer of 1 bp). Furthermore, the motif is similar to the matrix family V$STAT which describes binding sites of STAT factors.
If you are interested in more details, the original CoreSearch algorithm is described in
| © 1998-2011 Genomatix Software GmbH - All rights reserved |