Genomatix-Logo
Overview of Help-Pages

Genomatix: Complete Workflow for ChIP-Seq Analysis
Genomatix: Peak finding in NGS Data
(only available on GGA)


[ Introduction] [ Parameters] [ Output]

Introduction

The ChIPSeq-Workflow task is designed to allow a complete analysis of ChIPSeq data (including replicates), starting with read statistics, followed by a peak finding step and several downstream analysis options like evaluating transcription factor binding site overrepresentation in the resulting peaks or definition of new TF binding sites from the peak regions.

The Peak Finding task is a subset of the ChIPSeq Workflow, but allows more general input data (from ChIP-Seq/RNA-Seq experiments) and stops with the called peaks, i.e. no automatic downstream analyses. It analyses the input tags, finds significant regions, and optionally evaluates peaks with control file(s).
The parameters for the peak finding step of the ChIPSeq-Workflow are the same as in the Peak Finding task and are described an this page.

For the peak finding step three different methods can be used, for details please see the corresponding pages or the parameter details below:

The MACS peak finding algorithm should only be used for ChIPSeq data, MACS2 can be used for ChIPSeq as well as for histone data (then with broad option), SICER is recommended for histone modifications, whereas NGSAnalyzer can be used for finding peaks in ChipSeq and RNASeq data.

If the user supplies a control data set (no replicates), there are different strategies on how the control data is used to evaluate the peaks:

Please, see this help page for a detailed description of the different evaluation strategies.

When using replicate data as input, there are different options on how the data is treated. First all input data ('treatment') is clustered separately and then the peaks from the different replicates can be either merged or intersected (parameters supplied by user, see parameter section below).
When the final peaks are computed, all input reads from the input and control replicates are distributed on the peaks and the counts are used for significance analysis by DESeq or edgeR.
Only significantly enriched peaks are then used for the following downstream steps.
See the table below for an overview of input and parameter combinations.

Most analysis steps are optional (except the peak finding step) and have default parameters, allowing a highly flexible first complete look on ChIPSeq data.

As all intermediate results (like peaks/clusters and their corresponding sequences) can be saved from the result page, they can directly be used for other downstream analyses (e.g. TF analysis, subset selection based on user-defined criteria in the BED file toolbox, or using CoreSearch with all available parameters).

Differential Analysis

If the read counts in peaks for two different conditions (here called "treatment" and "control" for simplicity) are to be compared, the following statistical testing methods for evaluating differences in read abundance are available:

While the Audic-Claverie-method does not handle replicates, 'DESeq2', 'DESeq' and 'edgeR' were developed specifically for replicate data.

Audic and Claverie introduced a formula to compute a conditional probability for observing N reads (treatment) in a class given that M reads were observed before (control). These p-values, in combination with the Genomatix normalized expression (NE) value are used to evaluate differential expression.

The 'DESeq2', 'DESeq' and 'edgeR' methods both model count data (here the number of reads from an ChIP-Seq experiment within a region) by a negative binomial distribution. The parameters of the distribution (mean and dispersion) are estimated from the data, i.e. from the read counts in the input files. Both methods compute a measure of read abundance (called 'base mean' in DESeq/DESeq2, and 'concentration' in edgeR) for each region and apply a hypothesis test to each region to evaluate difference in read abundance. In particular, both methods determine a p-value and a log2 fold change (in read abundance) for each region.

For defining enriched and depleted peaks between two conditions or samples, the following criteria are used (parameters set by the user):

Note that the first input set is regarded as "treatment", whereas the second input file is used as "control", i.e. "enrichment" refers to a higher read abundance in set1 than in set2. Also note that the direction of enrichment and depletion will change if the two data sets are exchanged in the input.


Parameters

Input
Input file(s) with read positions from ChIPSeq ("Sample"/"Treatment")

Input data are accepted in BED / bigBed file format or BAM file format containing the input regions. For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

  • upload one or several BED/BAM files from your local computer
  • or import one or several BED/BAM files from the GMS (see more details)
  • or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window, the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the project management.

Control file(s)
Optional Control file(s)
for differential analysis

If one or several input control files (replicates) are available, they can be uploaded here. This is an optional field, so if no control files are available the checkbox should be left blank.

The upload options for control files are hidden by default. They appear only when checking the checkbox "Use second set of input files (control files) for differential analysis".
Workflow parameters
Read Classification When checked, a read classification is done for each BED file from the input data: The number of input reads overlapping genomic elements like exons, introns, promoters and intergenic regions will be given in the result.
Peak Finding / Clustering The peak finding step is mandatory, as all subsequent analysis steps rely on the results of the peak finding.
For the peak finding / clustering one of three different algorithms can be selected:

Depending on the user selection, further parameters appear for the respective algorithm:

NGS Analyzer parameters
Window size The window size used for the peak finding algorithm in basepairs. The default window size is 100 bp.
Minimum number of reads per peak A threshold for the peak finding algorithm. This number can be automatically calculated from the input data by applying a Poisson distribution. Otherwise, values above 3 are allowed here.
Strand specificity Strand specificity of the sequencing experiment

MACS/MACS2 parameters
Narrow/broad peaks
(only for MACS2)
When the broad peak flag is on, MACS2 will try to composite broad regions by putting nearby highly enriched regions into a broad region with loose cutoff.
When broad peaks are called, the no-model option is automatically set (i.e. this overrides the mfold parameters below).
Suggested for some histone modifications, not for ChIPSeq.
Tag size The length of the input tags/reads. This value can be determined from the input file, by reading the first 100 regions and calculating the average region length.
q-value
(only for MACS2)
The q-value (minimum FDR) cutoff to call significant regions. Default is 0.01.
For broad marks, you can try 0.05 as cutoff.
Q-values are calculated from p-values using Benjamini-Hochberg procedure.
p-value
(only for MACS1.4)
P-value cutoff for peak detection. Default is 1e-5
Bandwidth This value is used while building the shifting model. Default is: 300
model fold (mfold) parameter The upper and lower model fold values for MACS to select the regions with a high-confidence enrichment ratio against background to build a model. If no models are found, the no-model option is used by MACS automatically.
Redundancy threshold The number of copies of identical reads allowed in a library. Values can be 'auto', 'all' or an integer
  • 'auto': MACS calculates the maximum tags at the exact same location based on binomial distribution using 1e-5 as p-value cutoff
  • 'all': all input tags are kept
  • integer: at most this number of tags will be kept at the same location

SICER parameters
Redundancy threshold The number of copies of identical reads allowed in a library.
Window size Resolution of SICER algorithm. For histone modifications, one can use 200 bp.
Note from the SICER manual: The choice of window size and gap size has a large effect on outcome. In general, the broader the domain, the bigger the gap should be. For histone modifications H3K4me3, W=200 and (gap = 1 window) are suggested. For H3K27me3, W=200 and (gap = 3 windows) are suggested for first try. If even bigger gap size is found to work better, you might also want to try increasing the window size eg, window size = 1K, and gap = 3 windows)
Fragment size Is for determination of the amount of shift from the beginning of a read to the center of the DNA fragment represented by the read. FRAGMENT_SIZE=150 means the shift is 75.
Gap size Needs to be multiples of window size. Namely if the window size is 200, the gap size should be 0, 200, 400, 600, ...
FDR The FDR is calculated using p-value adjusted for multiple testing, following the approach developed by Benjamini and Hochberg.
E-value nr. of islands expected in random background, only if no control data supplied
Note: E-value is not p-value. Suggestion for first try on histone modification data: E-value=100. If you find ~10000 islands using this evalue, an empirical estimate of FDR is 1E-2

Replicate parameters
Replicate Treatment When using replicate data as input, first all input data is clustered separately. Then the peaks from the different replicates from the treatment group are either merged or intersected.
To this end, first a coverage profile is created. This helps to identify contiguous peaks from the different treatment peak files. In particular, for each position within a contiguous region the number of the replicates (from the treatment group) having this position in common is calculated (i.e. how many replicates contain a peak covering this position).
There are two parameters to control how the merging/intersection is done:
  • minimum number of replicates with peak at a position:
    This parameter can be used to set the required fraction of replicates in a peak. If for example there are 3 replicates, and the parameter is set to 60%, it is sufficient that 2 replicates have overlapping peaks. (2/3 ≈ 67%).
  • minimum length of common region:
    Only "candidate regions" which have a common region of the specified length will be kept.
The following example illustrates the effect of the parameters. Let's assume we do have three replicates (r1, r2, r3) with overlapping regions at different positions, as illustrated by the following graphic that also includes the coverage at each position:
r1 ----------     -----          -------------------
r2       -----------    ------     -----        -------
r3    ---------------              -----        -----------------------------
   11122233332222233211011111100011333331111111133332221111111111111111111111 <- coverage profile

With the minimum number of replicates set to 60% and a mininum region length of 10 we would get the following results ('+' denotes positions where the threshold for minimum number of replicates is met):
Merging peaks:
   --++++++++++++++++-- ------   --+++++--------+++++++----------------------
   M1                   M2       M3

Intersecting peaks:
     ++++++++++++++++              +++++        +++++++
     I1                            I2           I3
  • For the peak M1 we have a length of 20 nucleotides, with 15 covered by at least 2 peaks, so we do have 67% coverage for 15 out of 30 nucleotides. These regions would therefore be merged into a peak.
  • For M2 we do only have a single region from r2 which has no overlap and thus fails for both thresholds.
  • For M3 we have a length of 44, with at least 2 common regions for 12 nucleotides, satisfying both thresholds.
  • For I1 we have 16 positions where both thresholds are met, therefore I1 is a peak.
  • I2 is only 5 positions long, below the minimum of 10 and I2 is rejected.
  • For I3 there are 7 positions, which is still below the minimum, so I3 is not considered as well.
Afterwards, the reads from all input BED files, treatment and control, are distributed on the new regions (found by merging resp. intersecting the peaks from the treatment as described above). This way, for each region we find the number of reads contained in this region, for each condition and replicate. Based on these count data the chosen statistical test is applied to evaluate the difference in read abundance between treatment and control.
Differential Analysis / Peak Evaluation with control
Differential Analysis parameters
The original algorithms for peak evaluation from MACS/MACS2 / SICER can be selected in this section if
  • MACS/MACS2 or SICER was selected for peak finding and
  • exactly one input and one control file
are used as input for the ChIPSEq workflow.

Additionally there are the following options:

The differential analysis parameter section will only appear, if at least one control file was uploaded in the section above.
There are four available algorithms for calculating the differential expression/enrichment values:

  • DESeq (recommended for replicate data, but does work on non-replicates, too)
    It is possible to select the 'fitting type' parameter for DESeq, i.e. the way how the curve is fitted through the dispersion estimates. For details on the meaning of this parameter please refer to the DESeq vignette.
  • DESeq2 (recommended for replicate data, but does work on non-replicates, too)
    As for DESeq, it is possible to select the 'fitting type' parameter for DESeq2, i.e. the way how the curve is fitted through the dispersion estimates.
    Additionally, DESeq2 offer two alternative methods for testing for differential expression: Wald test and Likelihood-ratio test (with Wald test being the default).
    For details on the meaning of the parameters please refer to the DESeq2 vignette.
  • edgeR (recommended for replicate data, does not work on non-replicates)
  • the procedure introduced by Audic and Claverie (if no replicates are available)
For a short introduction to the different methods, see above in the Introduction to the Differential Analysis.

The thresholds that define a transcript as differentially expressed (or a region as enriched/depleted) can be set here. There are two criteria, that are combined (both must be satisfied for differential expression/enrichment):

  • an adjusted p-value threshold for the significance of observing the detected change
    Note that the p-values calculated by the different methods (DESeq/DESeq2, edgeR, Audic-Claverie) can differ.
    Also note, that setting the p-value to 1 allows skipping of this criterium.
  • a threshold for the log2 fold change of expression/enrichment level
    A log2 ratio of 1 is a fold change of 2; a log2 ratio of 0.585 is a fold change of 1.5; e.g. if the log2 fold change of expression/enrichment is set to ≥ 1, the expression values must go up by at least 100% to appear in the differentially expressed transcripts/enriched regions list.
    The log2 fold change thresholds can be set separately for up- and down-regulation (enrichment/depletion).
    Note, that by setting the log2 fold change thresholds to 0, fold changes are ignored in the analysis.

Downstream Analysis (only in ChIPSeq-Workflow)
Peak Classification When this option is checked, a peak classification and statistics is done. The number of peaks (defined in the previous peak finding step) overlapping genomic elements like exons, introns, promoters and intergenic regions will be given in the result.
Sequence Extraction When this option is checked, the DNA sequence of all peak regions defined in the peak finding step is extracted in FASTA format. The resulting sequence file can be downloaded from the result and be used for further downstream analysis.
TFBS Overrepresentation
When this option is checked, the peak sequences are checked for transcription factor binding sites (TFBS) and statistics on TFBSs together with overrepresentation values and Z-scores are generated.
The binding sites are based on the matrix families as available in MatBase and searched by MatInspector

For more details please see the task Overrepresented TFs.
Definition of new TFBS If this option is checked, a subset of peak sequences is selected and then the Genomatix program CoreSearch is used to define new common motifs from the peak sequences.
The program automatically sorts the peak sequences by quality, and the number of "best sequences" used for CoreSearch can be set by the user. Here, the definition of "best" peak sequences depends on the selected peak finding algorithm:
  • for NGSAnalyzer (no control) the peaks are sorted by decreasing NE-value
  • for MACS/MACS2 (no control) the peaks are sorted by decreasing MACS score
  • for SICER (no control) the peaks are sorted by decreasing SICER score
  • if a Audic-Claverie evaluation was used, the peaks are sorted by increasing p-values
  • if DESeq/edgeR evaluation was used, the peaks are sorted by increasing p-values
The sequences actually used for CoreSearch can also be downloaded from the result.
Output
Result Here, you can edit the default name of the result file.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

We recommend to use the email option for the ChIPSeq Workflow!

Please note, that a job started with the "show result in browser" option, will finish correctly even if the display in the window stops, and can then be found in the result management.

Output

The output generally consists of several sections:
  1. Analysis Parameters
  2. Main Results, depending on user selection
    • Read Classification
    • Peak Finding / Clustering
    • Peak Classification (only in ChIPSeq-Workflow)
    • Sequence Extraction (only in ChIPSeq-Workflow)
    • TFBS Overrepresentation (only in ChIPSeq-Workflow)
    • Definition of new TFBS (only in ChIPSeq-Workflow)
  3. Download of Data Files

1. Analysis Parameters


2. Main Results, depending on user selection

For each selected analysis step, the results are listed in the output:

Sample Read Classification and Statistics

If the read statistics was selected as parameter, a table with the number of reads from the input overlapping genomic elements like exons, introns, promoters and intergenic regions is given.
Additionally a table with the distribution of the reads on the different chromosomes of the genome is shown. The content of this table is hidden by default, but can be shown by clicking the "Show details" link in the header.

If the read classification is included in the output, detailed annotation for each read can be downloaded as a tab-separated file. For a description of the format of this file, please see the cluster classification details.


Peak Finding / Cluster Generation

Here is a short overview of what happens at different input/parameter settings:

no control data setone control data setreplicates for control data
one sample data set
  • input data is clustered
  • no control / peak evaluation
  • input data is clustered
  • peak evaluation by original algorithm (MACS/MACS2/SICER)
    or Audic-Claverie for NGSanalyzer
  • only significantly enriched peaks are used for further analysis
  • input data is clustered
  • the first control data set is used for peak evaluation by original algorithm (MACS/MACS2/SICER)
    or Audic-Claverie for NGSanalyzer
  • only significantly enriched peaks are used for further analysis
replicates for sample data
  • each replicate data set is clustered separately
  • peaks from replicates are merged
  • no control / peak evaluation
  • each replicate data set is clustered separately
  • peaks from replicates are merged
  • reads from sample and control data are distributed on the merged peaks
  • peak evaluation by differential analysis (DESeq/edgeR)
  • only significantly enriched peaks are used for further analysis

Depending on the selected peak finding algorithm and the parameters, the output can look different in this section:

For all settings, the total number of peaks found by the program is shown and the resulting peaks can either be downloaded as a BED file or saved directly to the Genomatix project management, to be used with other tasks. Also, a link to the complete algorithm output is given, including details as described for NGS Analyzer or MACS or SICER respectively.

If "Peak Evaluation with Audic-Claverie Algorithm" was selected, the BED file contains only those peaks, which show a significant enrichment of reads, i.e. a subset of all significant peaks. "Significant" in this context means, that the Audic-Claverie p-value is at most as high as the cut-off specified on the input page. Additionally, a tab-separated file containing the p-values and other details for each input peak can be downloaded (details for p-value file). This file contains all significant peaks, no matter if there is an enrichment or decrease of reads.

Note that the BED file format is zero-based and half-open, whereas numbering in the tab-separated p-value file is based at 1 and includes the end position.

peak results output

If differential analysis by DEseq/edgeR was conducted, the output will contain a table with the numbers and download links for the analyzed regions, the significant regions (both enriched and depleted), the enriched regions and the depleted regions.

cluster results output

The analysis results are available in two formats, TAB-delimited (.tsv) and BED. Each single file can be dowloaded using the link in the table, but they are also available for download in one tar-archive. The tsv-files have the following columns:
    1  Id: region id, an integer
    2  Contig: contig/chromosome accession number
    3  Chromosome: chromosome (with leading 'chr', e.g. 'chr1', 'chrX')
    4  Strand: strand of the region, '+', '-' or '0' if not strand-specific
    5  Start pos: region start position
    6  End pos: region end position
    7  Length: region length (end - start + 1)
    8  #Read total: sum of the read counts of all replicates
    9  p-value: p-value resulting from the hypothesis test of the selected test method
                for difference in read abundance (DESeq or edgeR)
   10  adj. p-value: Benjamini-Hochberg adjusted p-value (from column 9)
   11  log2(fold-change): logarithmic fold-change in read abundance in treatment and control
                          (> 0 means an enrichment in treatment,  < 0 means a decrease in treatment)
   12  Regulation: keyword indicating increase or decrease of reads in the region;
                   'up' if log2(fold-change) > 0, 'down' if log2(fold-change) < 0, 'no' otherwise
    + read count for each replicate
    + NE value for each replicate
   
i.e. if there are X replicates, there will be 12+2*X columns
The format of the BED files is as follows:
    1  chromosome: chromosome (with leading 'chr', e.g. 'chr1', 'chrX')
    2  start: start position of the region
    3  end: end position of the region
    4  id: region id
    5  adj. p-value: as in tsv-file column 9 or 10
    6  strand: strand of the region, '+', '-' or '0' if not strand-specific

Plots

Note:The graphics are at first depicted as smaller icons, which can be enlarged by clicking on them.
  • 93.fold_change_plot.png
    graphics in PNG-format with a scatter plot of fold-change in the read abundance in treatment versus control (y-axis) against read abundance (x-axis). For edgeR, the measure of abundance is the concentration, for DESeq, it is the base mean, i.e. the values shown in the plot can be found in 92.output_replicate_analysis columns 4 and 5. Each data point resembles a region (i.e. merged/intersected peaks from the treatment group), those showing significant difference in read abundance ((adjusted) p-value below significance level) are colored in red. Regions having zero reads in the control group, i.e. no reads from any replicate for the control condition reside within the region, are not shown in this plot: DESeq assigns these regions a non-numeric base mean resp. fold-change ('NA' or 'Inf'), edgeR computes very large or small fold-changes resp. very small log-concentration values for these regions, which might cause problems with the scale of the graph. If you specified thresholds for the fold-change, they are shown in the plot as blue dashed lines.
    Fold-change plot
  • 94.fold_change_all_plot.png (only if test method is 'edgeR')
    The same plot as 93.fold_change_plot.png, but containing all transcripts, also those with a total count of zero in (at least) one group
  • 95.volcano_plot.png
    graphics in PNG-format with a volcano plot adjusted p-value (y-axis) against fold-change in the read abundance of treatment versus control (x-axis). For edgeR, the measure of read abundance is the concentration, for DESeq, it is the base mean, i.e. the values shown in the plot can be found in 92.output_replicate_analysis columns 2 or 3, and 5, or in 10.region_summary.tsv columns 9 or 10, and 11. Each data point resembles a region found by merging/intersecting peaks from the treatment group. Regions having zero reads in the control group, i.e. no reads from any replicate for the control condition reside within the region, are not shown in this plot: DESeq assigns these regions a non-numeric fold-change ('NA' or 'Inf'), edgeR computes very large or small fold-change values for these regions, which might cause problems with the scale of the graph. If you specified thresholds for the fold-change, they are shown in the plot as blue dashed lines. To avoid problems with the plotting routine, p-values < 1e-311 are omitted from the volcano plot.
    Volcano plot
  • 96.volcano_all_plot.png (only if test method is 'edgeR')
    The same plot as 95.volcano_plot.png, but containing all transcripts, also those with a total count of zero in (at least) one group. To avoid problems with the plotting routine, p-values < 1e-311 are omitted from the volcano plot.

Peak Classification

The peak classification provides information for
peak statistics output

Peak Classification: Details for each peak

The file that can be downloaded here contains the classification for each peak:

The file contains 7 columns (tab-separated) for each region:

  1 : read id
  2 : contig/chromosome accession number
  3 : chromosome
  4 : strand
  5 : start position of the read
  6 : end position of the read
  7 : genomic elements the read is associated with
      intergenic (intergenic region)
      exon
      intron
      partial (overlapping with exon)
      promoter
                An individual read is assigned to one of the four classes
                intergenic, exon, intron, partial and can be assigned to
                the class promoter in addition.
1428	NC_000001	chr1	0	1348237	1348455	intergenic
1429	NC_000001	chr1	0	2311642	2311953	promoter intron
1430	NC_000001	chr1	0	2450272	2450768	exon
1431	NC_000001	chr1	0	2469265	2469512	intergenic
1432	NC_000001	chr1	0	3556424	3556796	promoter partial
1433	NC_000001	chr1	0	3614623	3614831	exon

Sequence Extraction (only in ChIPSeq-Workflow)

If this step was selected in the parameter section, the DNA sequences for all peaks are extracted and written to a file in FASTA format with Genomatix annotation.

The number of sequences and the total number of basepairs is given, and the first few lines of the file are displayed. The complete file can be downloaded or saved to the project management for further analysis with other tasks.

sequence extraction output

TFBS Overrepresentation (only in ChIPSeq-Workflow)

When this option is checked, all peak sequences are automatically checked for transcription factor binding sites (TFBS) (using MatInspector). The TFBS occurence is compared to expected values based on the background of occurrences of the TFBS

The most overrepresented TFBS compared to genomic background as well as for promoter background is listed in the output.
Additionally a link to the complete output table containing occurence, overrepresentation values and Z-scores for all analysed TBFSs is supplied (see detailed table description).
tf overrepresentation overview

Definition of new TFBS (only in ChIPSeq-Workflow)

If this option is checked, a subset of peak sequences is selected and then the Genomatix program CoreSearch is used to define new common motifs from those sequences.
The program automatically sorts all peak sequences by quality, and the user-defined number of "best sequences" is used for CoreSearch. Here, the definition of "best" peak sequences depends on the selected peak finding algorithm: The motif(s) defined by CoreSearch are listed in the output together with their IUPAC representation and the re-value of the corresponding matrix.
A link to the detailed CoreSearch output is given. From this page, the motif can be saved as a matrix for further downstream analysis, e.g. to search other sequences (e.g. all peak sequences) for matches to this motif.
The sequences actually used for CoreSearch can be also downloaded from the result or saved directly to the Genomatix project management, to be used with other Genomatix tasks.
coresearch overview

3. Download of Data Files

All result files that can be downloaded separately from the result page together with the statistics files (in text format) can be downloaded as an archive (tar-file).

Data files if one of the test methods 'edgeR', 'DESeq' or 'DESeq2' was selected