![]() |
![]() |
If the read counts in clusters for two different conditions (here called "treatment" and "control" for simplicity) are to be compared, two statistical testing methods for evaluating differences in read abundance are available:
Anders S, Huber W (2010)
Differential expression analysis for sequence count data
Genome Biology 2010;11:R106
Robinson MD, Smyth GK (2007)
Moderated statistical tests for assessing differences in tag abundance
Bioinformatics 2007;23(21):2881-2887
Robinson MD, Smyth GK (2008)
Small-sample estimation of negative binomial dispersion, with applications to SAGE data
Bioinformatics 2008;9(2):321-332
For defining enriched and depleted clusters between two conditions or samples, the following criteria are used (parameters set by the user):
Benjamini Y, Hochberg Y (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J Roy Stat Soc B 1995;57:289-300
| Input | |||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Input file(s) with read positions from ChIPSeq ("Sample"/"Treatment") |
Input data are accepted as a tab delimited file in BED / bigBed file format containing the input regions specified at
least by chromosome number, start position and end position (in this order).
When adding a new file, a new window will open, asking you to either
For the new BED files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that BED files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a BED file. You can see the list of genomes available in ElDorado. Note that almost all browsers have a general upload limit of 2 GB, i.e. BED files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS. Optionally you can specify a name for saving uploaded BED files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each BED file name. If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped. After one or several BED files were uploaded successfully, and after closing the popup window,
the list of available BED files will be automatically updated.
Uploaded BED files can be deleted from the project anytime via the project management. Example input: track description="sample treatment-control analysis with 3 treatments and 3 controls" |
||||||||||||||||||||||||||||||||
| Control file(s) | |||||||||||||||||||||||||||||||||
| Optional Control file(s) for differential analysis |
If one or several input control files (replicates) are available, they can be uploaded here. This is an optional field, so if no control files are available the checkbox should be left blank. The upload options for control files are hidden
by default. They appear only when checking the checkbox "Use second set of input files (control files) for differential analysis".
|
||||||||||||||||||||||||||||||||
| Workflow parameters | |||||||||||||||||||||||||||||||||
| Read Classification | When checked, a read classification is done for each BED file from the input data. The number of input reads overlapping genomic elements like exons, introns, promoters and intergenic regions will be given in the result. | ||||||||||||||||||||||||||||||||
| Peak Finding / Clustering | This step is mandatory, as all subsequent analysis steps rely on the results of the clustering.
For the peak finding / clustering one of three different algorithms can be selected:
Depending on the user selection, further parameters appear for the respective algorithm:
| ||||||||||||||||||||||||||||||||
| Replicate Treatment |
When using replicate data as input, first all input data is clustered separately.
Then the clusters from the different replicates from the treatment group are
either merged or intersected. To this end, first a coverage profile is created. This helps to identify contiguous clusters from the different treatment cluster files. In particular, for each position within a contiguous region the number of the replicates (from the treatment group) having this position in common is calculated (i.e. how many replicates contain a cluster covering this position). There are two parameters to control how the merging/intersection is done:
r1 ---------- ----- ------------------- r2 ----------- ------ ----- ------- r3 --------------- ----- ----------------------------- 11122233332222233211011111100011333331111111133332221111111111111111111111 <- coverage profileWith the minimum number of replicates set to 60% and a mininum region length of 10 we would get the following results ('+' denotes positions where the threshold for minimum number of replicates is met):
Merging clusters:
--++++++++++++++++-- ------ --+++++--------+++++++----------------------
M1 M2 M3
Intersecting clusters:
++++++++++++++++ +++++ +++++++
I1 I2 I3
|
||||||||||||||||||||||||||||||||
| Differential Analysis parameters |
The differential analysis parameter section will only appear,
if at least one control file was uploaded in the section above.
For a short introduction to the different methods, see above in the
Introduction to the Differential Analysis.
The thresholds that define a transcript as differentially expressed (or a region as enriched/depleted) can be set here. There are two criteria, that are combined (both must be satisfied for differential expression/enrichment):
|
||||||||||||||||||||||||||||||||
| Cluster Classification | When this option is checked, a cluster classification and statistics is done. The number of clusters (defined in the previous clustering step) overlapping genomic elements like exons, introns, promoters and intergenic regions will be given in the result. | ||||||||||||||||||||||||||||||||
| Sequence Extraction | When this option is checked, the DNA sequence of all clusters (peak regions) defined in the clustering step is extracted in FASTA format. The resulting sequence file can be downloaded from the result and be used for further downstream analysis. | ||||||||||||||||||||||||||||||||
| TFBS Overrepresentation | When this option is checked, the cluster sequences are checked for
transcription factor binding sites (TFBS) and
statistics on TFBSs together with overrepresentation values and Z-scores are generated.
The binding sites are based on the matrix families as available
in
MatBase
and searched by MatInspector
For more details please see the RegionMiner Task Overrepresented TFs. |
||||||||||||||||||||||||||||||||
| Definition of new TFBS | If this option is checked, a subset of cluster sequences is selected and
then the Genomatix program CoreSearch is
used to define new common motifs from the cluster sequences.
The program automatically sorts the cluster sequences by quality, and the number of "best sequences" used for CoreSearch can be set by the user. Here, the definition of "best" cluster sequences depends on the selected clustering algorithm:
|
||||||||||||||||||||||||||||||||
| Output | |||||||||||||||||||||||||||||||||
| Result | Here, you can edit the default name of the result file. | ||||||||||||||||||||||||||||||||
| Email address | Here you can choose between two methods for receiving
the results:
The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management! We recommend to use the email option for the ChIPSeq Workflow! Please note, that a job started with the "show result in browser" option, will finish correctly
even if the display in the window stops, and can then be found in the result management.
|
||||||||||||||||||||||||||||||||
If the read statistics was selected as parameter, a table with the
number of reads from the input overlapping genomic elements like exons, introns, promoters
and intergenic regions is given.
Additionally a table with the distribution of the reads on the different
chromosomes of the genome is shown. The content of this table is hidden by
default, but can be shown by clicking the "Show details" link in the
header.
If the read classification is included in the output, detailed annotation for each read can be downloaded as a tab-separated file. For a description of the format of this file, please see the cluster classification details.
| no control data set | one sample control set | replicates for control data | |
|---|---|---|---|
| one sample data set |
|
|
|
| replicates for sample data |
|
|
|
Depending on the selected peak finding algorithm and the parameters, the output can look different in this section:
For all settings, the total number of clusters found by the program is shown and the resulting clusters can either be downloaded as a BED file or saved directly to the Genomatix project management, to be used with other RegionMiner tasks. Also, a link to the complete algorithm output is given, including details as described for NGS Analyzer or MACS or SICER respectively.
If "Peak Evaluation with Audic-Claverie Algorithm" was selected, the BED file contains only those clusters, which show a significant enrichment of reads, i.e. a subset of all significant clusters. "Significant" in this context means, that the Audic-Claverie p-value is at most as high as the cut-off specified on the input page. Additionally, a tab-separated file containing the p-values and other details for each input cluster can be downloaded (details for p-value file). This file contains all significant clusters, no matter if there is an enrichment or decrease of reads.
Note that the BED file format is zero-based and half-open, whereas numbering in the tab-separated p-value file is based at 1 and includes the end position.


1 Id: region id, an integer
2 Contig: contig/chromosome accession number
3 Chromosome: chromosome (with leading 'chr', e.g. 'chr1', 'chrX')
4 Strand: strand of the region, '+', '-' or '0' if not strand-specific
5 Start pos: region start position
6 End pos: region end position
7 Length: region length (end - start + 1)
8 #Read total: sum of the read counts of all replicates
9 p-value: p-value resulting from the hypothesis test of the selected test method
for difference in read abundance (DESeq or edgeR)
10 adj. p-value: Benjamini-Hochberg adjusted p-value (from column 9)
11 log2(fold-change): logarithmic fold-change in read abundance in treatment and control
(> 0 means an enrichment in treatment, < 0 means a decrease in treatment)
12 Regulation: keyword indicating increase or decrease of reads in the region;
'up' if log2(fold-change) > 0, 'down' if log2(fold-change) < 0, 'no' otherwise
+ read count for each replicate
+ NE value for each replicate
i.e. if there are X replicates, there will be 12+2*X columns
The format of the BED files is as follows:
1 chromosome: chromosome (with leading 'chr', e.g. 'chr1', 'chrX')
2 start: start position of the region
3 end: end position of the region
4 id: region id
5 (adj.) p-value: as in tsv-file column 9 or 10, depending on the parameter setting for
'p-value adjustment'
6 strand: strand of the region, '+', '-' or '0' if not strand-specific




The file contains 7 columns (tab-separated) for each region:
1 : read id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the read
6 : end position of the read
7 : genomic elements the read is associated with
intergenic (intergenic region)
exon
intron
partial (overlapping with exon)
promoter
An individual read is assigned to one of the four classes
intergenic, exon, intron, partial and can be assigned to
the class promoter in addition.
1428 NC_000001 chr1 0 1348237 1348455 intergenic 1429 NC_000001 chr1 0 2311642 2311953 promoter intron 1430 NC_000001 chr1 0 2450272 2450768 exon 1431 NC_000001 chr1 0 2469265 2469512 intergenic 1432 NC_000001 chr1 0 3556424 3556796 promoter partial 1433 NC_000001 chr1 0 3614623 3614831 exon



All result files that can be downloaded separately from the result page together with the statistics files (in text format) can be downloaded as an archive (tar-file).
1. id: region id
2. p-value: p-value resulting from the hypothesis test of the selected test method for difference in read abundance (DESeq or edgeR)
3. adj. p-value: Benjamini-Hochberg adjusted p-value (from column 2)
4. log2FoldChange: logarithmic (base 2) fold-change in read abundance/expression level in treatment
over control (> 0 is enrichment in treatment, < 0 is depletion in treatment); for DESeq, the
fold-change corresponds to the base mean, for edgeR to the concentration
for DESeq the remaining columns are
5. baseMean: mean read abundance across all replicates, treatment and control
6. baseMean control: ean read abundance within control group
7. baseMean treatment: mean read abundance within treatment group
8. variance ratio control: ratio of the estimate of the base variance of the counts for the control
group and the value predicted with the base variance function; according to the package authors,
a large value may indicate a false hit;
see the R-vignette for details (within the vignette, this value is referred to as 'resVarA')
9. variance ratio treatment: as column 8, just for the treatment group; within the DESeq vignette,
this value is referred to as 'resVarB'
for edgeR the remaining column is
5. log concentration: expression level, logarithmic (base 2)
| © 1998-2011 Genomatix Software GmbH - All rights reserved |