Genomatix-Logo
Overview of Help-Pages

Genomatix: Structural Variant Analysis
(only available on GGA)


[Introduction] [Algorithm] [Parameters] [Output]

Introduction

The structural variant detection software can be used to find genomic structural variants (SVs) including

Detection is based on paired-end reads from a DNA-Seq experiment. The software will identify structural variants based on abnormally mapped read pairs, e.g. read pairs that are not within the expected distance on the genome or where the mates have different strand orientation. To receive multiple lines of evidence for the predicted structural variants the algorithm additionally integrates both a read-depth-coverage approach and a split-mapping approach.


Algorithm

Identification of structural variant candidates: The paired-end structural variant discovery pipeline consists of four main steps:
  1. Filtering
  2. Clustering
  3. Including splice junction sequences (not yet included)
  4. Scoring

Filtering

Read pairs are selected as candidate structural variants if both reads map uniquely to the genome on the same or different chromosomes. Mates that align within expected distance on the same chromosome with expected strand orientation are removed. In addition, the program discards read pairs with mapping qualities lower than the minimum required mapping quality. Artificial pileup reads are also removed.

Clustering

For the identification of structural variant clusters, the following criteria are used:

Scoring

In order to estimate the quality of a candidate deletion, two breakpoint scores and average coverage values for each deletion event are calculated.

The breakpoint scores describe the difference between the sequence coverage of the left and right part of the predicted deletion. A higher breakpoint score indicates a better quality of the deletion event.

To calculate the breakpoint scores for each deletion, in a first step the coverage values of the deletion and its flanking regions (about 100 bp) are calculated. Then, both breakpoint scores (ranging from 0 to 100) will be calculated as

100 * (c1 - delC) / c1
100 * (c2 - delC) / c2

where delC is the coverage value of the deletion and c1, c2 are the coverage values of the left and right part of the deletion.

Example 1 (clearly defined breakpoint):
Sequence Coverage upstream of breakpoint region: 15
Sequence Coverage downstream of breakpoint region: 0
Breakpoint Score: 100(15-0)/15 = 100

Example 2 (no clear breakpoint):
Sequence Coverage upstream of breakpoint region: 15
Sequence Coverage downstream of breakpoint region: 12
Breakpoint Score: 100(15-12)/15 = 20


Parameters

Input
Input file(s) with read positions Input data are accepted in BAM file format containing the aligned reads. Within this section you can either
  • choose from previously uploaded BAM files
  • or add a new BAM file to the list (by clicking "Add BAM file...")
You can use shift/ctrl-keys to select multiple files from the list. All selected files will then be used as input for the samtools routine.
Parameters for read pair distribution The values for the expected read pair distance distribution can either be
  • calculated automatically (by first reading all input read pairs)
  • or set to specific values by the user:
    • Expected strand orientation of the mate pairs
      options: forward-reverse, forward-forward, reverse-reverse,or reverse-forward
    • Mean Insert Size
    • 1-sigma Insert Size
      standard deviation of the read pair distance distribution
    • Sigma Threshold
      This is the number of standard deviations used for the detection of insertions and deletions. This parameter should be set to 3. With a lower value you will obtain more structural variants but also more false positives. A higher value will give you less structural variants. If it is set to -1, this value and the mean and standard deviation distance will be calculated automatically.
Filters Minimum Cluster Size (2-1000)
This parameter determines the minimum number of non-redundant read pairs in one 'structural variant' cluster.
Maximum PileUp Size (1-10000)
This parameter determines the maximum allowed pileup size. A maximum pileup size of 1 means that all redundant sequences will be discarded for structural variant discovery, except for one.
A value of n > 1 means that all sequences which occur more than n times in the data, will be discarded (except for pileups where the neighbor sequences are also redundant).
If the "ignore / no max." option is checked, no sequences will be removed.
Minimum Mapping Quality (mapQ, 0-255)
This parameter determines the minimum required mapping quality (mapQ). Mapping quality scores quantify the probability that a read is misplaced and were introduced by Heng Li and Richard Durbin in 2008. It is related to uniqueness. The greater the quality distance between the best alignment hit and the second best alignment hit, the more unique the best alignment, and the higher its mapping quality should be. The mapping quality should usually be between 0 and 60.
For example, a mapping quality of 10 or less indicates that there is at least a 1 in 10 chance that the read truly originated elsewhere. A value of 255 indicates that the mapping quality is not available. For paired-end alignment, the pairing information (distance and strand orientation of the mates) will also be included.
The default setting is 20.
Coverage Filter for Deletions
If this parameter is set, the algorithm prints all deletions where the coverage value is less than the averaged chromosomal coverage.
If it is not set, all deletions will be reported.
Breakpoint Score Threshold for Deletions (0-100%)
This parameter determines the minimum required breakpoint score of deletions to be reported. Both breakpoint scores need to exceed this value for a deletion to be reported.
The default setting is 20%.
Circos plots Include Circos plots for each chromosome
If set, the detected structural variants will be graphically visualized with the open source drawing software CIRCOS.
For each chromosome, deletions (blue), insertions (orange), inversions (green) and the average nucleotide coverage (black) will be shown. For the whole genome, all detected inter-chromosomal translocations will be visualized in purple. The darker the color of the links the more read pairs are supporting the predicted structural variant.
Note that this is a time-consuming option, therefore the Circos plot are not included by default.
Output
Result Name Here, you can edit the default name of the result file.
Email address
When the analysis is finished, an email with the URL of the results will be sent to the user provided email address.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

Output

The output generally consists of several sections:

  1. Analysis parameters
  2. Hit Distribution
  3. Details & Files
  4. Genes
  5. Cluster Size
  6. Deletion Size
  7. Insertion Size
  8. Circos Plots
  9. Download of Data Files

1. Analysis Parameters

A listing of all parameters, in particular:


2. Hit Distribution

Graphical representation of the number of detected structural variants (insertions, deletions, inversions, duplications, intra- and inter-chromosomal translocations).

The graphics can be downloaded in various formats (PNG, JPEG, PDF, SVG).


3. Details & Files

This section allows to download detailed result files for the different types of structural variants.


4. Genes

This section lists all genes affected by a structural variant. The "GenomeBrowser" link allows you directly view the reads supporting the structural variant.


5. Cluster Size

The graphics shows the distribution of the cluster size (number of read pairs supporting the structural variant).

The graphics can be downloaded in various formats (PNG, JPEG, PDF, SVG).


6. Deletion Size

The graphics shows the distribution of the size of the detected deletions (number of read pairs supporting the deletion).

The graphics can be downloaded in various formats (PNG, JPEG, PDF, SVG).


7. Insertion Size

The graphics shows the distribution of the size of the detected insertions (number of read pairs supporting the insertion).

The graphics can be downloaded in various formats (PNG, JPEG, PDF, SVG).


8. Circos Plots

Graphical visualization of the structural variants with the open source drawing software CIRCOS.

The Circos plots show the inter-chromosomal translocations in the whole genome. Chromosome identifiers are shown around the outer ring and are oriented in clockwise orientation. Other tracks (from outside to inside) contain logarithmic average nucleotide coverage (black, 2000000 bin size) and inter-chromosomal translocations (purple, associated with number of read pairs involved in the link).

Example:

Intra-chromosomal structural variants are shown individually for each chromosome. Chromosomal positions are shown around the outer ring and are oriented in clockweise direction. Other tracks (from outside to inside) contain average nucleotide coverage (black, 2000 bin size), insertions (orange), deletions (blue), and inversiosn (green).

Example:

9. Download of Data files

00_Logfile:
Summarizes the parameters, software versions, input directories and result directory.

01_SV_Statistics:
Contains numbers and information about predicted structural variants, for example the cluster size distribution, strand orientation and chromosomal distribution.

11_SV.bed:
Chromosomal coordinates from read pairs which could be mapped to a 'structural variant' cluster (BED format).

11_SV.bam / 11_SV.bam.bai:
Chromosomal alignments from read pairs which could be mapped to a 'structural variant' cluster (BAM format / BAM index file).

20_Translocation:
Summarizes the detected inter-chromosomal rearrangements. This file is ranked by the number of non-redundant spliced reads.

21_Translocation_Reads:
Lists the sequence identifier for all read pairs which map to an inter-chromosomal translocation.

22_Translocation.bed:
Chromosomal coordinates from read pairs which map to an inter-chromosomal translocation. (BED format).

30_InDel:
Summarizes the detected insertion, deletion, and intra-chromosomal candidates. This file is ranked by the number of non-redundant spliced reads.

31_InDel_Reads:
Lists the sequence identifier for all read pairs which map to an insertion, deletion, or intra-chromosomal translocation cluster.

32_InDel.bed:
Chromosomal coordinates from read pairs which map to an insertion, deletion, or intra-chromosomal translocation cluster (BED format).

40_Inversion:
Summarizes the detected inversion and duplication candidates. This file is ranked by the number of non-redundant spliced reads.

41_Inversion_Reads:
Lists the sequence identifier for all read pairs which map to an 'inversion' or 'duplication' cluster.

42_Inversion.bed:
Chromosomal coordinates from read pairs which map to an 'inversion' or 'duplication' cluster (BED format).

50_SV_Spliced_Reads:
Lists the sequence identifier for all spliced reads which map to a 'structural variant' cluster.