Genomatix-Logo
Overview of Help-Pages

Genomatix: Small Variant Detection in NGS data
(only available on GGA)


[Introduction] [Parameters] [Output] [References]

Introduction

This tasks detects single nucleotide polymorphisms (SNPs) and small insertions and deletions using SAMtools/BCFtools (Li et al, 2009).

The prediction of SNPs and InDels takes one or more BAM file(s) as input. When multiple samples are supplied, all of them are analysed simultaneously and the result will contain data from ALL samples.

First, samtools mpileup will be called, which computes the likelihood of the data given specific quality parameters. The Small Variant Detection workflow then applies bcftools to use that prior data to call the variants. Finally it calls the SAMtools script vcfutils.pl to filter out some of the data.

The resulting variants will then be available as a VCF (variant call format) file which is automatically added to the result management. This file can be used e.g. to annotate the variants using the Variant Analysis task.
Additionally, several statistics (quality distribution, chromosome statistics, InDel length distribution) are displayed graphically.


Regarding multi-sample versus single-sample SNP-calling here is a citation from https://github.com/samtools/samtools/wiki/FAQ:
"In all, if you have deep coverage and need to study each sample separately, you should use single-sample calling. If you have low-coverage data or only care about variants from multiple samples as a whole, you should use multi-sample calling. Understanding the difference between single- and multi-sample calling also helps experimental design: if you only want to get a set of SNPs from many samples or to do association studies, sequencing to deep coverage is a waste. You pay much more only to get marginal reward."

Please also see the "Multisample SNP calling" option below.


Parameters

Input
Input file(s) with read positions Input data are accepted in BAM file format containing the aligned reads. Within this section you can either
  • choose from previously uploaded BAM files
  • or add a new BAM file to the list (by clicking "Add BAM file...")
You can use shift/ctrl-keys to select multiple files from the list. All selected files will then be used as input for the samtools routine.
Read Coverage Minimum and maximum read coverage to call a variant.
To allow unlimited maximum read coverage, tick the ignore-checkbox.
Remove sequence duplicates Specifies whether only one read or all reads are considered for the variant call if pileups of identical reads occur.
Extended BAQ computation Extended computation of base alignment quality (BAQ):
BAQ is a phred-like score representing the probability of a read base being misaligned; it lowers the base quality score of mismatches that are near indels. This is to help avoiding false positive SNP calls due to alignment artifacts near small indels. Regular BAQ computation is turned on by default.

This option allows a more sensitive BAQ calculation (it helps sensitivity but may hurt specificity, i.e. you might get more false positives).
Minimum mapping quality (mapQ) The minimum required mapping quality (mapQ value in BAM file). Mapping quality scores quantify the probability that a read is misplaced and were introduced by Heng Li and Richard Durbin in 2008. It is related to uniqueness. The greater the quality distance between the best alignment hit and the second best alignment hit, the more unique the best alignment, and the higher its mapping quality should be. The mapping quality should usually be between 0 and 60.
For example, a mapping quality of 10 or less indicates that there is at least a 1 in 10 chance that the read truly originated elsewhere. A value 255 indicates that the mapping quality is not available.
For paired-end alignment, the pairing information (distance and strand orientation of the mates) will also be included.
Allowed values: 0-255
Multisample SNP calling new-logo
Implements the multi-sample calling option (bcftools view -m 0.99).
Citing from https://github.com/samtools/samtools/wiki/FAQ:
Recent versions (tagged as 0.1.19 or newer) implement a new calling model (bcftools view -m) where multi-sample calling comes at no sensitivity cost. It also fixes some of the known issues with multiallelic sites. This is currently the recommended way of calling.
For more details please see the FAQ page at github
Minimum nucleotide quality Minimum phred quality score for mismatching nucleotides
Allowed values: 0-60
Variant Type Skip InDel calling; no InDels will be reported in the output
Data type
(Ion Torrent/454)
Small Variant calling for Ion Torrent and 454 data:
The standard SAMtools settings are tuned toward data that does not include as many insertion/deletion sequencing errors as Ion Torrent/454 data.
When selecting this option the Small Variant Detection workflow uses specific parameters to reduce the false positive rate for IonTorrent/454 data.
Output
Result Here, you can edit the default name of the result file.
Email address
When the analysis is finished , an email with the URL of the results will be sent to the user provided email address.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

Output

The output has three sections:

1. Analysis Parameters

  • Name of input file(s)
  • The ElDorado database version used
  • Result name
  • List of parameters
  • The actual samtool command line call

2. VCF files and numbers of called variants

  • A link to the resulting VCF file
    When following the link, the first few lines of the VCF file are displayed. Additionally, the VCF file can be downloaded from this page.
    Note that the VCF file is automatically available in the result management and can be used for other tasks.
    For details on the VCF format please refer to the 1000 Genomes Project website.
  • Number of total Variants
  • Number of SNPs
  • Number of InDels
  • Transition/Transversion ratio
    • Transitions are interchanges of two-ring purines (A ↔ G) or of one-ring pyrimidines (C ↔ T).
    • Transversions are interchanges of purine for pyrimidine bases.

Example:

reveal box

3. Graphics

Three different graphics for the following aspects are available:
  • Quality Distribution
  • Chromosome Statistics
  • InDel length Distribution
Each of the graphics allows zooming in by dragging the mouse within the chart area.
Most points within the graphics feature tool tips with detailed information and values.
Additionally, a complete series of values can be hidden by clicking on the series' label in the legend.
Clicking on the icon on the top right of a chart allows printing or exporting the picture as it is (e.g. zoomed in). export graphics

Quality Distribution:

Distribution of quality for detected variants
quality graphics

Chromosome Statistics:

The distribution of detected variants for each chromosome of the genome and the number of base pairs for each chromosome.
Chromosome graphics

InDel length Distribution:

Size distribution of the detected InDels.
InDel graphics

References

This task uses SAMtools which is described in the following publications:

Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.;
1000 Genome Project Data Processing Subgroup (2009).
The Sequence Alignment/Map format and SAMtools
Bioinformatics 25 (16): 2078-2079.

Li, H. (2011).
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.
Bioinformatics 27 (21): 2987-2993.