Genomatix-Logo
Overview of Help-Pages

CoreSearch: Definition of common motifs


[Introduction] [Input] [Parameters] [Output] [References]

Introduction

CoreSearch is a tool to define unknown common motifs in a set of unaligned DNA sequences. CoreSearch starts with a search for a highly conserved core sequence (called "tuple" in the original publication) which occurs in almost all of the input sequences. In most cases this initial search defines more than one core. Consecutive selection steps are employed in order to reduce the number of core candidates as soon as possible. The selection is based on maximization of the information content (consensus index), first of the core and then of regions around the core.

CoreSearch features:

CoreSearch has been considerably improved compared to the version published in 1996. Most important improvement is the running time. The new CoreSearch version is 10 times faster than the old version.


New in CoreSearch 6.0 (April 2009):

CoreSearch accepts an unlimited number of input sequences (e.g. from ChIP-Seq experiments). In case more than 250 sequences are uploaded to CoreSearch, a new fast algorithm is used for definition of the common motif. Definition of a motif in about 2000 ChIP-Seq regions usually takes less than 1 minute.

The follwing steps are performed by the "large-scale" CoreSearch version:


Input

General: Sequence Formats
Accepted DNA sequence formats The following formats for DNA sequences are accepted: There should be only IUPAC characters in the sequence, any other characters will be skipped!
Sequence Input
Choose from your previously uploaded sequences Select a sequence file from the list of your personal sequence files which were saved in the result management in prior analyses (via "add sequences", see below).
Quick Upload new Paste your sequence(s) in the form field in one of the accepted formats (see above). Note that sequences pasted in the "quick upload" field are not saved for future use.
Add sequences

Sequences or sequence files uploaded here are automatically saved in the result management for later use:

Enter the formatted DNA sequence(s) Enter your correctly formatted sequence(s) directly into the form, e.g. with copy and paste (see above for accepted formats).
or upload a file containing sequence(s) (max. 100 MB) If your browser supports this option, a sequence file can be uploaded.
If you use this option, the file should contain the sequence(s) in either one of the formats listed above.
Please note, that the size for uploaded files is limited to 100 MB. If you want to analyze larger sequences please contact support@genomatix.de. For whole chromosomes you can use the accession number option below (e.g. 'NC_000001' for human chromosome 1).
Accession number(s) If you are interested in one or several special sequences from a database section, you can supply a list of accession numbers. If you want to select more than one accession number, please separate the accession numbers by commas or spaces.

On the Genomatix server accession numbers from the following databases can be entered:

  • GenBank (sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, Viruses, ESTs) (e.g. 'M65229')
  • Eukaryotic Promoter Database (EPD) (e.g. 'EP30014')
  • NCBI Reference Sequences (mRNA sequences) (e.g. 'NM_000402')
  • Genomatix Promoter Database (e.g. 'GXP_107276')
  • dbSNP (e.g. 'rs1234')

CoreSearch Parameters

CoreSearch Parameters
Length of core Length of the highly conserved core sequence that is searched initially. The core sequence needs not to be identical in all sequences, it may contain a single mismatch at a random position.

The default length of 7 bp is suitable for most sequence sets. In case the sequences are longer than 1000 bp, it may be necessary to increase the length of the core to avoid random matches and to speed-up the run time.

Minimum number of sequences The lower limit of sequences within the input set that has to contain the core sequence and the final motif. Default is the absolute number of sequences that corresponds to at least 75% of the input sequences.
Number of motif matches per sequence By default, CoreSearch looks for at most one motif match per sequence. In case you assume that the sequences might contain more than one match of the same motif, change this parameter to "any number of repetitions".

Note: In case the input set includes more than 250 sequences (i.e. the "large-scale" CoreSearch algorithm is used), this parameter can only be set to "at most one motif match per sequence".

Strand CoreSearch searches for motifs on both the given strand and the reverse complement strand by default. Setting the "Strand" parameter to "search only top strand" will cause CoreSearch to search the given strand only.
A priori frequency of nucleotides By default, CoreSearch uses the nucleotide distribution of the input sequences as a priori frequency. Alternatively, equal distribution of all nucleotides can be used.

Min. matrix similarity CoreSearch defines motifs described as weight matrices. All motif matches have to be identified by the resulting matrix. Therefore, only sequences that reach the minimum matrix similarity are included in the matrix, all other sequences are rejected.

In case CoreSearch is not able to identify a common motif in the input sequences, it may help to decrease the matrix similarity.

Max. number of motifs CoreSearch will look for up to this number of distinct motifs in the input sequence set. CoreSearch stops when this number of motifs has been found, or when no more motif is found that fulfills the search criteria.

Note: In case the input set includes more than 250 sequences (i.e. the "large-scale" CoreSearch algorithm is used), only one motif can be defined.

Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


CoreSearch Output

For each motif found, CoreSearch displays all motif matches including the following information: Furthermore, the conservation profile of the motif and the IUPAC consensus sequence is given. The similarity of the motif to transcription factor binding site descriptions of the Genomatix' matrix library is shown as additional information. For the "large-scale" version, intermediate analysis results (motifs defined in randomly selected subsets) are also given.

Extraction Options


Example output


References

If you are interested in more details, the original CoreSearch algorithm is described in