![]() |
![]() |
For those RegionMiner tasks that perform a clustering and peak finding (ChIP-Seq Workflow, Clustering NGS Data), data from an additional 'input-control' experiment can be provided. Depending on the selected evaluation strategy this data is used to detect unspecific enrichments and to thus filter the clusters with the input control.
The different strategies are described in detail below.
The program first assigns the reads from the control file to the clusters in the cluster file. A read is assigned to a cluster if it is positioned completely within start and end of the cluster. For this assignment it is assumed that the clusters are disjoint. After this step the number of control reads (rc) within each cluster is known (this number can be found in the result file). Based on this number (rc), the number of original reads in the cluster (re) and the total numbers of reads in experiment resp. control, the conditional probabilities of the events
Benjamini Y, Hochberg Y (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J Roy Stat Soc B 1995;57:289-300
5 NC_000001 chr1 0 865557 866465 909 42 0.07502 90 0.319 0Here the Claverie p-value means: the decrease of reads in experiment compared to control was expected with a probability of 31.9% (i.e. is not very significant), while in
49 NC_000001 chr1 0 2333684 2334183 500 17 0.05520 11 0.00334 1the enrichment of experiment compared to control is clearly significant for a p-value cutoff of 0.05 (0.00334 < 0.05).
Notes:
Reference:
The result file contains all clusters from the input cluster file, and the p-value (Audic-Claverie distribution) for each of the clusters. In particular for each cluster the following information is annotated (TAB-separated, left to right):
Id Contig Chromosome Strand Start End Length ... 14324 NC_000067 chr1 0 4761228 4762444 1217 ... 14325 NC_000067 chr1 0 4766400 4766883 484 ... 14328 NC_000067 chr1 0 4774026 4774186 161 ... 14329 NC_000067 chr1 0 4775649 4775773 125 ... ... #Reads Cluster NE value #Reads Control p-value Enriched ... 255 0.32136 501 2.82e-06 0 ... 166 0.52602 350 3.65e-06 0 ... 63 0.60015 115 0.0422 0 ... 28 0.34355 85 9.11e-05 0
For a detailed description of the MACS algorithm and the control evaluation see the MACS paper, here is an excerpt:
MACS models the tag distribution along the genome by a Poisson distribution. The advantage of this model is that one parameter, lambdaBG, can capture both the mean and the variance of the distribution.
Instead of using a uniform lambdaBG estimated from the whole genome, MACS uses a dynamic parameter, lambdalocal, defined for each candidate peak as:
lambdalocal = max(lambdaBG, [lambda1k,] lambda5k, lambda10k)
where lambda1k, lambda5k and lambda10k are lambda estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case lambda1k is not used).
Starting with version 1.3.7. MACS can also consider the lambda5k and lambda10k from the input data set, which is thought to be more suitable for sharp peaks.
lambdalocal captures the influence of local biases, and is robust against occasional low tag counts at small local regions.
In the control samples, we often observe tag distributions with local fluctuations and biases. MACS uses lambdalocal to calculate the p-value of each candidate peak and removes potential false positives due to local biases (that is, peaks significantly under lambda BG, but not under lambdalocal). Candidate peaks with p-values below a user-defined threshold p-value (default 10-5) are called, and the ratio between the ChIP-Seq tag count and lambdalocal is reported as the fold_enrichment.
For a detailed description of the SICER algorithm and the contro evaluation, please refer to the SICER publication:
A clustering approach for identification of enriched domains from histone modification ChIP-Seq data| © 1998-2013 Genomatix Software GmbH - All rights reserved |