![]() |
![]() |
This RegionMiner task allows a complete analysis for Copy Number Variation (CNV) in DNA-Seq data. Such an analysis is composed of two steps. First, the mapped reads are assigned either to a fixed genomic segment (bin) or to a genomic region (locus) depending on the applied algorithm. In a second step, the selected CNV detection algorithm takes these read counts to infer the regions with a copy number different to the expected one for the organism. An integer value representing the change in copy number is assigned to each detected region.
The CNV analysis requires two input datasets. The first input is the sample data set (e.g. condition 1) an the second one is the control data set (e.g. condition 2). The input files need to be in BAM format and are taken from your currently selected Genomatix project folder. Preferably, both data sets should show a similar coverage and should also be derived by the same sequencing and analysis protocol to ensure a good output quality .
Note: This CNV analysis cannot be performed on a single data set. A a control data set is always required.
Note: Although some of the included algorithms may feature replicates as input data, this task does currently not support analyses based on replicates.
When the read counts for two different conditions (here called treatment and control for simplicity) are to be compared, the following 3 algorithms/methods for analyzing the variation of copy numbers are available:
"cn.mops (Copy Number estimation by a Mixture Of PoissonS) is a data processing pipeline for copy number variations and aberrations (CNVs and CNAs) from next generation sequencing (NGS) data. The package supplies functions to convert BAM files into read count matrices or genomic ranges objects, which are the input objects for cn.mops. cn.mops models the depths of coverage across samples at each genomic position. Therefore, it does not suffer from read count biases along chromosomes. Using a Bayesian approach, cn.mops decomposes read variations across samples into integer copy numbers and noise by its mixture components and Poisson distributions, respectively. cn.mops guarantees a low FDR because wrong detections are indicated by high noise and filtered out."
The RegionMiner CNV analysis incorporates the R package for cn.MOPS. The read counts are calculated outside of cn.MOPS and are inserted as read count matrices into the algorithm. The size of the fixed bins is an input parameter for the read count calculation and has to supplied by the user. Based on the starting position of the aligned read, it is assigned to the corresponding bin. This is done for each input sample individually so that for each bin two read count values are obtained, one for the sample and one for the control data set. This leads to a matrix with a row for each bin and two columns for each of the two input files. Since cn.MOPS requires a minimum of three sample columns and in order to give the control data set a higher weight, the control column is duplicated and gives a third column in the input matrix. In a second step the RegionMiner task runs cn.MOPS for each chromosome independently so that the statistics are performed for each chromosome. The output contains the log fold change from the individual bins (local assessments) and the final CNV call after segmentation.
Günter Klambauer, Karin Schwarzbauer, Andreas Mayr, Djork-Arné Clevert, Andreas Mitterecker, Ulrich Bodenhofer, Sepp Hochreiter. "cn.MOPS: mixture of Poissons for discovering copy number variations in next generation sequencing data with a low false discovery rate." Nucleic Acids Research 2012 40(2)
FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA.
The RegionMiner CNV analysis directly takes the binary executable of FREEC without any further preprocessing steps. Then, the algorithm of FREEC performs similar steps as described above for cn.MOPS. First, the read counts are calculated from the input BAM files. Normalization by GC content may be applied to the read counts if the input files were based on the latest genomic build. Ultimately, FREEC calls the CNVs using the normalized read counts for both the sample and the control input. Please note, in this RegionMiner task a control input is always required although FREEC itself does not depend on any control data.
Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. "Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization" Bioinformatics 2011 27(2):268-9
Based on a genomic mapping, the normalized expression values(NE) per locus are calculated:
The number of reads that align to a locus is calculated and normalized for the number of basepairs of this locus.
This is the NE value for this locus.
The CNV value for each locus is calculated by dividing the numerical NE value in the sample input by the NE value in the control input.
If no change is observed between the two conditions, the CNV value of this locus is 1 or near 1.
Consequently, a higher CNV value stands for an increase in copy number of this gene in the genome of condition 1 as compared to condition 2,
while a lower CNV value denotes a loss of copies of this gene in condition 1 as compared to condition 2.
For the sake of clarity, the plot shows all CNV values > 5 as 5.0.
A special case is a NE value of 0 in file 1, implying that this gene is not present (not mappable) in the genome in condition 1.
This is plotted as a CNV value of 0.
The second special case is a NE value of 0 in condition 2.
This is plotted as a CNV value of 5.0.
If neither condition registers a NE value of >0 for a gene, the gene is not plotted.

There are several parameters that need to be specified to allow fine-tuning of the specific algorithm. Not all parameters are required or supported by all algorithms. If no comment is shown in the following table after the parameter name, it applies to all tools. Otherwise, the tools that require this parameter are explicitly listed.
| Input data | |
|---|---|
| Project folder |
Data sets and analyses are organized in PROJECTS for each user. Your default project folder is used for this analysis and can be changed in the result management. You can import or upload files to your current project. |
| Sample |
First, select a data set that serves as SAMPLE INPUT for the analysis. The available BAM files from your currently selected project folder are listed. Other files need to be imported first in order to become available in your project. |
| Control |
Depending on your input sample, please select a data set that serves as CONTROL INPUT for the analysis. The available BAM files from your currently selected project folder are listed. Here the files are already filtered for matching organism and genome build. Other files need to be imported first in order to become available in your project. |
| Workflow parameters | |
| Read type |
This parameter describes the sequence LIBRARY TYPE which can be set to single or paired end reads. If the read type is unknown it can be automatically detected based on the mapping in the input file. |
| Bin size |
Applies only to cn.MOPS and FREEC The BIN SIZE or WINDOW LENGTH defines the length of the initial segmentation of the genome in basepairs. The value should be chosen such that on the average 100 reads are present in each segment. |
| Algorithm parameters | |
| Minimum mapping quality |
Applies only to cn.MOPS and Genomatix Locus-based Only mappings that have a minimum PHRED-SCALED SCORE are considered for the algorithm input. |
| Prior impact |
Applies only to cn.MOPS The parameter PRIOR IMPACT should be optimized for each data set, since it is influenced by number of samples as well as noise level. This value reflects how strong the prior assumption affects the result. The higher the value, the more samples will have copy number 2, and consequently less CNVs will be detected. |
| Normalization type |
Applies only to cn.MOPS This option specifies the mode of the NORMALIZATION TECHNIQUE applied to the analysis. |
| Minimum number of bins |
Applies only to cn.MOPS and FREEC This value specifies the minimal number of CONSECUTIVE BINS to call a CNV. |
| Sex of samples |
Applies only to FREEC If samples are specified as FEMALE the chrY will be excluded from the analysis, otherwise if specified as MALE it will not annotate one copy of chrX and chrY as a loss. |
| Breakpoint threshold |
Applies only to FREEC The threshold for the BREAKPOINT controls how many segments are created. Use a value like 0.6 to get more segments (and thus more predicted CNVs). |
| GC normalization |
Applies only to FREEC If enabled it corrects the read count for GC-CONTENT BIAS and low mappability. This feature requires sequence files for the corresponding build and is currently only supported for the latest genome build version. It will be automatically disabled if the required files are not available on the system. |
| Minimal mappability |
Applies only to FREEC Only windows with fraction of mappable positions higher than or equal to the MAPPABILITY threshold will be considered. |
| Minimum GC content |
Applies only to FREEC The minimal expected value of GC-CONTENT distribution. For human input data a value of 0.35 is recommended. |
| Maximal GC content |
Applies only to FREEC The maximal expected value of the GC-CONTENT distribution. For human input data a value of 0.55 is recommended. |
| Analysis output | |
| Result name |
Assigning a NAME helps you later to identify the analysis in the project and result management. |
| Notification |
The system sends a NOTIFICATION message when the analysis has finished. Thus, you don't have to wait and can continue working on other tasks. The message will be sent to the email address provided by you. |
| Email address |
The address that will be used for the notification. You can set your default EMAIL ADDRESS in the user account settings. |
The output of the RegionMiner CNV analysis contains the following five sections, independent of which algorithm was selected:

The central output is a comprehensive data table that supports browsing and filtering of the result data in a convenient way. Each row in the table represents a single CNV call together with its copy number change. The CNV region may be composed of multiple bins which are combined to one segment. This table lists only changes to the expected copy number, i.e. only values not equal to zero, since a value of zero would be a region with the expected copy number for this organism. For the sake of clarity, copy number changes higher than +5 are truncated and shown as +5. Any click on a row in the table opens the annotation tab below the data table. The annotation tab provides more information on the specific region, in particular a list of annotated genes for that region. The most right columns provide entry points to other Genomatix tools like the Genome Browser visualization and the GeneRanker classification.
By default the CNV calls are sorted by genomic position. The following columns are available:

The annotation details appear when you click into one of the rows in the above data table. The annotation tab contains a list of all genes that were annotated within the selected region. A mouse-over over the gene symbol shows the gene preview with a brief description and its corresponding functions. Furthermore, there are links to jump into the Genome Browser for that gene.

This chart allows a detailed view at the binning for the selected region: it contains the normalized read counts for the input data sets, the green line shows the counts for the sample data set and the red line the counts for the control data set. The dotted line marks the local signal value for the individual bins. This value comes directly from the algorithm. The higher this value the more significant is the variation. A value of zero corresponds to an unchanged region. The algorithm joins regions with different signal levels to a combined CNV region which is a multiple of the selected bin size.
Another possibility to interact with the result data is to start the Genomatix Pathway System (GePS) for a further analysis of pathways and interactions. The annotated genes for the selected CNV region(s) are then passed to GePS automatically. The button to start a GePS analysis can be found in the export section below. By default it is grayed out and disabled. Selecting rows in the data table by clicking the checkboxes will enable this button.
Result data can be exported into one of the three formats: (1) BED file format, (2) TSV file format or (3) VCF file format. While any of three formats can be downloaded, only BED file format can be directly saved to your current project to be used as input for other Genomatix analysis.
The BED file format contains 5 columns (tab-separated) for each region:
1 : contig/chromosome 2 : start position of the read 3 : end position of the read 4 : cnv identifier/nr 5 : copy number variation changeAn example output may look like this:
chr1 106700000 110700000 CNV_1 2 chr1 112350000 112600000 CNV_2 5 chr1 114100000 118450000 CNV_3 2 chr1 120250000 120600000 CNV_4 2 chr1 144800000 145750000 CNV_5 1 chr1 146450000 147400000 CNV_6 1 chr1 149000000 149250000 CNV_7 -1 chr1 149800000 152450000 CNV_8 1 chr1 167700000 167900000 CNV_9 -1 chr1 180850000 181550000 CNV_10 1 ...
The TSV file format has 7 columns (tab-separated) for each region. The additional columns compared to the BED format contain the number and identifiers for the annotated genes of this region.
1 : contig/chromosome 2 : start position of the read 3 : end position of the read 4 : copy number variation change 5 : log2 fold change 6 : number of annotated genes 7 : comma-separated list of gene idsAn example output may look like this:
chr1 106700001 110700000 2 0.7697 50 1435,2944,6272,2947,2946,10451,6301,6814,2780,1952,2948,2949,10768,... chr1 112350001 112600000 5 1.8074 2 3752,643355 chr1 114100001 118450000 2 1.0000 44 4803,914,965,26191,7252,51592,64858,4893,270,79679,845,6847,81839,... chr1 120250001 120600000 2 1.0000 6 4853,83998,3158,11085,100506528,343505 chr1 144800001 145750000 1 0.5850 25 10628,148738,5174,10401,9939,11126,9554,9659,388677,8799,27246,... chr1 146450001 147400000 1 0.5850 15 2702,2703,2330,607,9557,5565,51205,149013,100509137,100509111,... chr1 149000001 149250000 -1 -1.0000 0 chr1 149800001 152450000 1 0.5850 91 8349,4170,1513,405,1520,9900,9129,7062,6281,51107,5710,6282,126961,... chr1 167700001 167900000 -1 -0.9999 3 55811,9019,25874 chr1 180850001 181550000 1 0.5850 6 10228,777,3140,51278,100509205,57710 ...
For details on the VCF format please refer to the 1000 Genomes Project website . For the CNV analysis, only the three columns of a VCF file are used:
1 : contig/chromosome 2 : start position of the read 8 : info column with end position and absolute copy numberAn example output may look like this:
##fileformat=VCFv4.1 ##fileDate=20120220 ##source=cnmopsV1.0.2 ##reference=NCBIbuild37 ##contig=<ID=chr1,length=249250621,assembly="NCBI build 37"> ... ##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##ALT=<ID=CNV,Description="Copy number variable region"> #CHROM POS ID REF ALT QUAL FILTER INFO chr1 106700001 . N. PASS NS=1;SVTYPE=CNV;END=110700000;CN=4 chr1 112350001 . N . PASS NS=1;SVTYPE=CNV;END=112600000;CN=7 chr1 114100001 . N . PASS NS=1;SVTYPE=CNV;END=118450000;CN=4 chr1 120250001 . N . PASS NS=1;SVTYPE=CNV;END=120600000;CN=4 chr1 144800001 . N . PASS NS=1;SVTYPE=CNV;END=145750000;CN=3 chr1 146450001 . N . PASS NS=1;SVTYPE=CNV;END=147400000;CN=3 chr1 149000001 . N . PASS NS=1;SVTYPE=CNV;END=149250000;CN=1 chr1 149800001 . N . PASS NS=1;SVTYPE=CNV;END=152450000;CN=3 chr1 167700001 . N . PASS NS=1;SVTYPE=CNV;END=167900000;CN=1 chr1 180850001 . N . PASS NS=1;SVTYPE=CNV;END=181550000;CN=3 ...

A genome-wide Circos plot can be found at the bottom of the result page. The purpose of this plot is to get a quick overview of the whole input and output range of the data sets. The plot itself consists of 5 circles (listed from the outermost):
The outermost circle shows the coverage data in form of read counts used as input for the algorithm. The read count value that intersects both input data sets is plotted in black color. If the read count was higher in the sample data set, the difference is shown in green. In contrast, if the read count was higher in the control data set, the difference is shown in red. Immediately after this circle some general information like the genomic position and cytoband information follow. The inner circle shows the output of the CNV analysis and represents the values from column 6 (Copy Number Variation) of the data table described above. The same colors as above are used here to differentiate between a copy number gain and a loss in copy number.
The plot was generated by: Circos
Krzywinski, M., J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones, and M. A. Marra. "Circos: An Information Aesthetic for Comparative Genomics." Genome Research 2009 19(9):1639-1645.
| © 1998-2013 Genomatix Software GmbH - All rights reserved |