Genomatix-Logo
Overview of Help-Pages

Genomatix: CNV analysis
(only available on GGA)


[Introduction] [Input] [Algorithms] [Parameters] [Output] [Genome]

Introduction

This task allows a complete analysis for Copy Number Variation (CNV) in DNA-Seq data. Such an analysis is composed of two steps. First, the mapped reads are assigned either to a fixed genomic segment (bin) or to a genomic region (locus) depending on the applied algorithm. In a second step, the selected CNV detection algorithm takes these read counts to infer the regions with a copy number different to the expected one for the organism. An integer value representing the change in copy number is assigned to each detected region.


Input

The CNV analysis requires two input datasets. The first input is the sample data set (e.g. condition 1) an the second one is the control data set (e.g. condition 2). The input files need to be in BAM format and are taken from your currently selected Genomatix project folder. Preferably, both data sets should show a similar coverage and should also be derived by the same sequencing and analysis protocol to ensure a good output quality .

Note: This CNV analysis cannot be performed on a single data set. A control data set is always required.

Note: Although some of the included algorithms may feature replicates as input data, this task does currently not support analyses based on replicates.


Algorithms

When the read counts for two different conditions (here called treatment and control for simplicity) are to be compared, the following 3 algorithms/methods for analyzing the variation of copy numbers are available:

  1. cn.MOPS
  2. Control-FREEC
  3. Genomatix Locus-based

1. cn.MOPS

From: Bioconductor version: Release (2.10)
"cn.mops (Copy Number estimation by a Mixture Of PoissonS) is a data processing pipeline for copy number variations and aberrations (CNVs and CNAs) from next generation sequencing (NGS) data. The package supplies functions to convert BAM files into read count matrices or genomic ranges objects, which are the input objects for cn.mops. cn.mops models the depths of coverage across samples at each genomic position. Therefore, it does not suffer from read count biases along chromosomes. Using a Bayesian approach, cn.mops decomposes read variations across samples into integer copy numbers and noise by its mixture components and Poisson distributions, respectively. cn.mops guarantees a low FDR because wrong detections are indicated by high noise and filtered out."

The CNV analysis incorporates the R package for cn.MOPS. The read counts are calculated outside of cn.MOPS and are inserted as read count matrices into the algorithm. The size of the fixed bins is an input parameter for the read count calculation and has to supplied by the user. Based on the starting position of the aligned read, it is assigned to the corresponding bin. This is done for each input sample individually so that for each bin two read count values are obtained, one for the sample and one for the control data set. This leads to a matrix with a row for each bin and two columns for each of the two input files. Since cn.MOPS requires a minimum of three sample columns and in order to give the control data set a higher weight, the control column is duplicated and gives a third column in the input matrix. In a second step the task runs cn.MOPS for each chromosome independently so that the statistics are performed for each chromosome. The output contains the log fold change from the individual bins (local assessments) and the final CNV call after segmentation.

Publication

Günter Klambauer, Karin Schwarzbauer, Andreas Mayr, Djork-Arné Clevert, Andreas Mitterecker, Ulrich Bodenhofer, Sepp Hochreiter. "cn.MOPS: mixture of Poissons for discovering copy number variations in next generation sequencing data with a low false discovery rate." Nucleic Acids Research 2012 40(2)

2. Control-FREEC

From: Institut Curie

FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA.

The CNV analysis directly takes the binary executable of FREEC without any further preprocessing steps. Then, the algorithm of FREEC performs similar steps as described above for cn.MOPS. First, the read counts are calculated from the input BAM files. Normalization by GC content may be applied to the read counts if the input files were based on the latest genomic build. Ultimately, FREEC calls the CNVs using the normalized read counts for both the sample and the control input. Please note, in this task a control input is always required although FREEC itself does not depend on any control data.

Publication

Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. "Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization" Bioinformatics 2011 27(2):268-9

3. Genomatix Locus-based

Based on a genomic mapping, the normalized expression values(NE) per locus are calculated: The number of reads that align to a locus is calculated and normalized for the number of basepairs of this locus. This is the NE value for this locus. The CNV value for each locus is calculated by dividing the numerical NE value in the sample input by the NE value in the control input.
If no change is observed between the two conditions, the CNV value of this locus is 1 or near 1. Consequently, a higher CNV value stands for an increase in copy number of this gene in the genome of condition 1 as compared to condition 2, while a lower CNV value denotes a loss of copies of this gene in condition 1 as compared to condition 2. For the sake of clarity, the plot shows all CNV values > 5 as 5.0.

A special case is a NE value of 0 in file 1, implying that this gene is not present (not mappable) in the genome in condition 1. This is plotted as a CNV value of 0.
The second special case is a NE value of 0 in condition 2. This is plotted as a CNV value of 5.0.
If neither condition registers a NE value of >0 for a gene, the gene is not plotted.


Parameters

CNV analysis start

There are several parameters that need to be specified to allow fine-tuning of the specific algorithm. Not all parameters are required or supported by all algorithms. If no comment is shown in the following table after the parameter name, it applies to all tools. Otherwise, the tools that require this parameter are explicitly listed.

Input data
Project folder

Data sets and analyses are organized in PROJECTS for each user.

Your default project folder is used for this analysis and can be changed in the result management. You can import or upload files to your current project.

Sample

First, select a data set that serves as SAMPLE INPUT for the analysis. The available BAM files from your currently selected project folder are listed. Other files need to be imported first in order to become available in your project.

Control

Depending on your input sample, please select a data set that serves as CONTROL INPUT for the analysis. The available BAM files from your currently selected project folder are listed. Here the files are already filtered for matching organism and genome build. Other files need to be imported first in order to become available in your project.

Workflow parameters
Read type

This parameter describes the sequence LIBRARY TYPE which can be set to single or paired end reads. If the read type is unknown it can be automatically detected based on the mapping in the input file.

Bin size

Applies only to cn.MOPS and FREEC

The BIN SIZE or WINDOW LENGTH defines the length of the initial segmentation of the genome in basepairs. The value should be chosen such that on the average 100 reads are present in each segment.

Algorithm parameters
Minimum mapping quality

Applies only to cn.MOPS and Genomatix Locus-based

Only mappings that have a minimum PHRED-SCALED SCORE are considered for the algorithm input.

Prior impact

Applies only to cn.MOPS

The parameter PRIOR IMPACT should be optimized for each data set, since it is influenced by number of samples as well as noise level. This value reflects how strong the prior assumption affects the result.

The higher the value, the more samples will have copy number 2, and consequently less CNVs will be detected.

Normalization type

Applies only to cn.MOPS

This option specifies the mode of the NORMALIZATION TECHNIQUE applied to the analysis.

Minimum number of bins

Applies only to cn.MOPS and FREEC

This value specifies the minimal number of CONSECUTIVE BINS to call a CNV.

Sex of samples

Applies only to FREEC

If samples are specified as FEMALE the chrY will be excluded from the analysis, otherwise if specified as MALE it will not annotate one copy of chrX and chrY as a loss.

Breakpoint threshold

Applies only to FREEC

The threshold for the BREAKPOINT controls how many segments are created. Use a value like 0.6 to get more segments (and thus more predicted CNVs).

GC normalization

Applies only to FREEC

If enabled it corrects the read count for GC-CONTENT BIAS and low mappability. This feature requires sequence files for the corresponding build and is currently only supported for the latest genome build version. It will be automatically disabled if the required files are not available on the system.

Minimal mappability

Applies only to FREEC

Only windows with fraction of mappable positions higher than or equal to the MAPPABILITY threshold will be considered.

Minimum GC content

Applies only to FREEC

The minimal expected value of GC-CONTENT distribution. For human input data a value of 0.35 is recommended.

Maximal GC content

Applies only to FREEC

The maximal expected value of the GC-CONTENT distribution. For human input data a value of 0.55 is recommended.

Analysis output
Result name

Assigning a NAME helps you later to identify the analysis in the project and result management.

Notification

The system sends a NOTIFICATION message when the analysis has finished. Thus, you don't have to wait and can continue working on other tasks. The message will be sent to the email address provided by you.

Email address

The address that will be used for the notification. You can set your default EMAIL ADDRESS in the user account settings.



Output

The output of the CNV analysis contains the following five sections, independent of which algorithm was selected:

  1. On the top page a summary section is printed about the analysis. This includes job properties and analysis parameters.
  2. The second section is a data table that shows all the CNV calls together with a brief annotation preview. This table can be sorted by various criteria and is accompanied by a list filter settings.
  3. Below this table you find in a tabbed view for additional annotation details and binning details for one specific CNV call.
  4. The export section allows to either download the filtered results to your local computer or save them to your project.
  5. A genome wide Circos plot wraps up the analysis and gives a quick overview of the CNV distribution throughout the genome.
Here is a detailed description of the sections:

Data table

Data table with CNV calls

The central output is a comprehensive data table that supports browsing and filtering of the result data in a convenient way. Each row in the table represents a single CNV call together with its copy number change. The CNV region may be composed of multiple bins which are combined to one segment. This table lists only changes to the expected copy number, i.e. only values not equal to zero, since a value of zero would be a region with the expected copy number for this organism. For the sake of clarity, copy number changes higher than +5 are truncated and shown as +5. Any click on a row in the table opens the annotation tab below the data table. The annotation tab provides more information on the specific region, in particular a list of annotated genes for that region. The most right columns provide entry points to other Genomatix tools like the Genome Browser visualization and the GeneRanker classification.

By default the CNV calls are sorted by genomic position. The following columns are available:

  1. Nr
    Each CNV call gets a unique number for identification purposes.
  2. Reference
    The chromosome name for the genomic region.
  3. Start
    The starting position of the region in basepairs.
  4. End
    The end position of the region in basepairs.
  5. Length
    The length of the region in basepairs.
  6. Copy Number Variation
    The positive copy numbers represent a gain and the negative values represent a copy number loss.
  7. Ratio
    This number is calculated as the logarithmic fold change for the variant.
  8. Annotation
    A brief annotation extract with a few genes. A click in this row opens a detailed annotation list below the table.
  9. Genes
    This value gives the number of genes that were annotated for this segment.
  10. Browser
    A link to view this region in the Genomatix Genome Browser together with the input files as tracks.
  11. Classification
    Starts a classification task with the Gene Ranker, allowing subsequent pathway analysis.

Annotation details

Annotation details for selected CNV call

The annotation details appear when you click into one of the rows in the above data table. The annotation tab contains a list of all genes that were annotated within the selected region. A mouse-over over the gene symbol shows the gene preview with a brief description and its corresponding functions. Furthermore, there are links to jump into the Genome Browser for that gene.

Binning details

Binning graph for selected CNV call

This chart allows a detailed view at the binning for the selected region: it contains the normalized read counts for the input data sets, the green line shows the counts for the sample data set and the red line the counts for the control data set. The dotted line marks the local signal value for the individual bins. This value comes directly from the algorithm. The higher this value the more significant is the variation. A value of zero corresponds to an unchanged region. The algorithm joins regions with different signal levels to a combined CNV region which is a multiple of the selected bin size.

Pathway analysis

Another possibility to interact with the result data is to start the Genomatix Pathway System (GePS) for a further analysis of pathways and interactions. The annotated genes for the selected CNV region(s) are then passed to GePS automatically. The button to start a GePS analysis can be found in the export section below. By default it is grayed out and disabled. Selecting rows in the data table by clicking the checkboxes will enable this button.

Export results

Result data can be exported into one of the three formats: (1) BED file format, (2) TSV file format or (3) VCF file format. While any of three formats can be downloaded, only BED file format can be directly saved to your current project to be used as input for other Genomatix analysis.

The BED file format contains 5 columns (tab-separated) for each region:

 1 : contig/chromosome
 2 : start position of the read 
 3 : end position of the read 
 4 : cnv identifier/nr
 5 : copy number variation change
An example output may look like this:
chr1	106700000	110700000	CNV_1	2
chr1	112350000	112600000	CNV_2	5
chr1	114100000	118450000	CNV_3	2
chr1	120250000	120600000	CNV_4	2
chr1	144800000	145750000	CNV_5	1
chr1	146450000	147400000	CNV_6	1
chr1	149000000	149250000	CNV_7	-1
chr1	149800000	152450000	CNV_8	1
chr1	167700000	167900000	CNV_9	-1
chr1	180850000	181550000	CNV_10	1
...

The TSV file format has 7 columns (tab-separated) for each region. The additional columns compared to the BED format contain the number and identifiers for the annotated genes of this region.

 1 : contig/chromosome
 2 : start position of the read 
 3 : end position of the read 
 4 : copy number variation change
 5 : log2 fold change
 6 : number of annotated genes
 7 : comma-separated list of gene ids
An example output may look like this:
chr1	106700001	110700000	2	0.7697	50	1435,2944,6272,2947,2946,10451,6301,6814,2780,1952,2948,2949,10768,...
chr1	112350001	112600000	5	1.8074	2	3752,643355
chr1	114100001	118450000	2	1.0000	44	4803,914,965,26191,7252,51592,64858,4893,270,79679,845,6847,81839,...
chr1	120250001	120600000	2	1.0000	6	4853,83998,3158,11085,100506528,343505
chr1	144800001	145750000	1	0.5850	25	10628,148738,5174,10401,9939,11126,9554,9659,388677,8799,27246,...
chr1	146450001	147400000	1	0.5850	15	2702,2703,2330,607,9557,5565,51205,149013,100509137,100509111,...
chr1	149000001	149250000	-1	-1.0000	0	
chr1	149800001	152450000	1	0.5850	91	8349,4170,1513,405,1520,9900,9129,7062,6281,51107,5710,6282,126961,...
chr1	167700001	167900000	-1	-0.9999	3	55811,9019,25874
chr1	180850001	181550000	1	0.5850	6	10228,777,3140,51278,100509205,57710
...

For details on the VCF format please refer to the 1000 Genomes Project website . For the CNV analysis, only the three columns of a VCF file are used:

 1 : contig/chromosome
 2 : start position of the read 
 8 : info column with end position and absolute copy number
An example output may look like this:
##fileformat=VCFv4.1
##fileDate=20120220
##source=cnmopsV1.0.2
##reference=NCBIbuild37
##contig=<ID=chr1,length=249250621,assembly="NCBI build 37">
...
##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	106700001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=110700000;CN=4
chr1	112350001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=112600000;CN=7
chr1	114100001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=118450000;CN=4
chr1	120250001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=120600000;CN=4
chr1	144800001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=145750000;CN=3
chr1	146450001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=147400000;CN=3
chr1	149000001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=149250000;CN=1
chr1	149800001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=152450000;CN=3
chr1	167700001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=167900000;CN=1
chr1	180850001	.	N		.	PASS	NS=1;SVTYPE=CNV;END=181550000;CN=3
...

Genome view

Part of the Circos plot

A genome-wide Circos plot can be found at the bottom of the result page. The purpose of this plot is to get a quick overview of the whole input and output range of the data sets. The plot itself consists of 5 circles (listed from the outermost):

  1. Coverage data
  2. Genomic position
  3. Cytobands
  4. Copy Number Variation
  5. Chromosome label

The outermost circle shows the coverage data in form of read counts used as input for the algorithm. The read count value that intersects both input data sets is plotted in black color. If the read count was higher in the sample data set, the difference is shown in green. In contrast, if the read count was higher in the control data set, the difference is shown in red. Immediately after this circle some general information like the genomic position and cytoband information follow. The inner circle shows the output of the CNV analysis and represents the values from column 6 (Copy Number Variation) of the data table described above. The same colors as above are used here to differentiate between a copy number gain and a loss in copy number.

The plot was generated by: Circos

Publication

Krzywinski, M., J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones, and M. A. Marra. "Circos: An Information Aesthetic for Comparative Genomics." Genome Research 2009 19(9):1639-1645.