Genomatix-Logo
Overview of Help-Pages

Genomatix: Peak Evaluation with a Control File


[Introduction] [Audic-Claverie Algorithm] [MACS original algorithm] [SICER original algorithm]

Introduction

For those tasks that perform a clustering and peak finding step (e.g. ChIP-Seq Workflow), data from an additional 'input-control' experiment can be provided. Depending on the selected evaluation strategy this data is used to detect unspecific enrichments and to thus filter the clusters with the input control.

The different strategies are described in detail below.


Audic-Claverie Algorithm

The program first assigns the reads from the control file to the clusters in the cluster file. A read is assigned to a cluster if it is positioned completely within start and end of the cluster. For this assignment it is assumed that the clusters are disjoint. After this step the number of control reads (rc) within each cluster is known (this number can be found in the result file). Based on this number (rc), the number of original reads in the cluster (re) and the total numbers of reads in experiment resp. control, the conditional probabilities of the events

given that rc reads were observed, are computed. For the computation of these p-values the formula described by Audic and Claverie is used. The lower of the two p-values is assigned to each cluster, along with a flag expressing the enrichment (= 1) resp. decrease (= 0) of reads in the cluster compared to the control. The procedure for multiple testing correction intoduced by Benjamini and Hochberg in

Benjamini Y, Hochberg Y (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J Roy Stat Soc B 1995;57:289-300

is applied to the p-values. In the result file, (adjusted) p-value and enrichment are the last two values in each line.

Example:

5       NC_000001       chr1    0       865557  866465  909     42      0.07502 90      0.319   0
Here the Claverie p-value means: the decrease of reads in experiment compared to control was expected with a probability of 31.9% (i.e. is not very significant), while in
49      NC_000001       chr1    0       2333684 2334183 500     17      0.05520 11      0.00334 1
the enrichment of experiment compared to control is clearly significant for a p-value cutoff of 0.05 (0.00334 < 0.05).

Notes:

Reference:

Audic S, Claverie JM (1997)
The significance of digital gene expression profiles
Genome Res. 1997;7(10):986-995

Output of Audic-Claverie Algorithm

The result file contains all clusters from the input cluster file, and the p-value (Audic-Claverie distribution) for each of the clusters. In particular for each cluster the following information is annotated (TAB-separated, left to right):

Id      Contig    Chromosome Strand Start   End     Length  ...
14324   NC_000067 chr1       0      4761228 4762444 1217    ...
14325   NC_000067 chr1       0      4766400 4766883 484     ...
14328   NC_000067 chr1       0      4774026 4774186 161     ...
14329   NC_000067 chr1       0      4775649 4775773 125     ...

...  #Reads Cluster NE value #Reads Control p-value  Enriched
...  255            0.32136  501            2.82e-06 0
...  166            0.52602  350            3.65e-06 0
...  63             0.60015  115            0.0422   0
...  28             0.34355  85             9.11e-05 0
Note that the corresponding BED file format is zero-based and half-open, whereas numbering in the tab-separated p-value file is based at 1 and includes the end position.


MACS original algorithm

For a detailed description of the MACS algorithm and the control evaluation see the MACS paper, here is an excerpt:

MACS models the tag distribution along the genome by a Poisson distribution. The advantage of this model is that one parameter, lambdaBG, can capture both the mean and the variance of the distribution.

Instead of using a uniform lambdaBG estimated from the whole genome, MACS uses a dynamic parameter, lambdalocal, defined for each candidate peak as:

lambdalocal = max(lambdaBG, [lambda1k,] lambda5k, lambda10k)

where lambda1k, lambda5k and lambda10k are lambda estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case lambda1k is not used).

Starting with version 1.3.7. MACS can also consider the lambda5k and lambda10k from the input data set, which is thought to be more suitable for sharp peaks.

lambdalocal captures the influence of local biases, and is robust against occasional low tag counts at small local regions.

In the control samples, we often observe tag distributions with local fluctuations and biases. MACS uses lambdalocal to calculate the p-value of each candidate peak and removes potential false positives due to local biases (that is, peaks significantly under lambda BG, but not under lambdalocal). Candidate peaks with p-values below a user-defined threshold p-value (default 10-5) are called, and the ratio between the ChIP-Seq tag count and lambdalocal is reported as the fold_enrichment.


SICER original algorithm

For a detailed description of the SICER algorithm and the control evaluation, please refer to the SICER publication:

A clustering approach for identification of enriched domains from histone modification ChIP-Seq data
Chongzhi Zang, Dustin E. Schones, Chen Zeng, Kairong Cui, Keji Zhao, and Weiqun Peng
Bioinformatics 25, 1952 - 1958 (2009)