Genomatix-Logo
Overview of Help-Pages

Variant Analysis


[Introduction] [Input] [Types of Analysis] [Parameters] [Output]

Introduction

This task analyses small variants (SNPs and InDels of a length up to 50 bp) for their genomic location, their effects on the amino acid sequence of a transcript and their influence on transcription factor binding sites (see the section "Types of analysis" for details). The Variant Analysis is available both on the GGA and in the Genomatix Software Suite, the difference being the supported input formats and the limited number of allowed input variants within the Suite.


Input

In the current version, three input formats are supported:

SNP and InDel data generated on the GMS (only available on GGA)

See your Genomatix Mining Station Manual for details on the SNP and InDel calling and output files on the GMS. Both, the '.snp' (TAB-separated) and the '.vcf' (VCF format including Genomatix custom 'INFO' field) formats are supported.
Example for a SNP calling result (.snp file) from the GMS:

#Organism: hsa
#SNP alignment file: /raid/GxMap/SOLiD/hg18/Blood/global/align.gaf
#ElDorado: 07-2009
#Read coverage: 10
#Minimum coverage of most frequent nucleotide: 80
#Chr    Accession no.  Position  Annotation     Allele  #A   #C   #G   #T   dbSNP Id   dbSNP allele  dbSNP strand
chr10   NC_000010      61776     intergenic     C/T     A;1  C;6  G;0  T;3  rs61838967 C/T           +
chr10   NC_000010      64110     intergenic     A/G     A;31 C;0  G;1  T;0  -          -             -
chr10   NC_000010      64133     intergenic     A/G     A;31 C;0  G;1  T;0  -          -             -
chr10   NC_000010      72010     intergenic     T/C     A;0  C;2  G;0  T;8  -          -             -
chr10   NC_000010      84026     exon           A/G     A;45 C;0  G;5  T;0  rs10904032 A/G           +
chr10   NC_000010      84426     intron         T/C     A;0  C;5  G;0  T;22 rs10904045 C/T           +
chr10   NC_000010      84545     intron         T/C     A;0  C;1  G;0  T;14 rs10904047 C/T           +
chr10   NC_000010      85074     intron         A/G     A;26 C;0  G;2  T;1  rs6560828  A/G           +
chr10   NC_000010      85949     promoter exon  T/C     A;1  C;0  G;0  T;31 rs10751931 C/T           +
chr10   NC_000010      97938     intergenic     G/A     A;4  C;0  G;36 T;0  -          -             -
...
	
Example for a VCF result from a SNP calling on the GMS:
##fileformat=VCFv4.1
##snp_file_used=/home/gx_sesame/projects/project_123456789/snps/analysis_123456789/_mapAll/60_SNP_Homozygous.snp
##source=GMS_snp_V2.0
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=ED,Number=1,Type=String,Description="ElDorado Annotation">
##INFO=<ID=NC,Number=4,Type=Integer,Description="Nucleotide counts: counts of A, counts of C, counts of G, counts of T">
##GMS_snp_parameter="input_file=[/home/gx_sesame/projects/project_123456789/snps/analysis_123456789/_mapAll/60_SNP_Homozygous.snp] output_directory=[test] organism=hsa eldorado_version=08-2011 minCoverage=10 maxCoverage=150 homozygousThreshold=80 heterozygousThreshold=80 ratioSecFirst=0.700000"
#CHROM  POS     ID           REF  ALT  QUAL  FILTER  INFO                                  FORMAT  SAMPLE
chr10   1255720 rs3817042    T    C    .     .       DP=11;AN=3;NC=0,9,0,2;ED=intron       GT:AD   1/2:12,10,8
chr10   1645160 rs111071594  C    G    .     .       DP=10;AN=3;NC=0,0,10,0;ED=intron      GT:AD   1/2:12,10,8
chr10   1645162 rs111073290  G    C    .     .       DP=10;AN=3;NC=0,10,0,0;ED=intron      GT:AD   1/2:12,10,8
chr10   1645313 rs56957716   C    T    .     .       DP=15;AN=3;NC=0,1,0,14;ED=intron      GT:AD   1/2:12,10,8
chr10   2079738 rs111075138  C    T    .     .       DP=10;AN=3;NC=0,1,1,8;ED=intergenic   GT:AD   1/2:12,10,8
chr10   2949142 rs111063091  G    C    .     .       DP=17;AN=3;NC=0,17,0,0;ED=intergenic  GT:AD   1/2:12,10,8
chr10   5412883 rs111207953  C    T    .     .       DP=10;AN=3;NC=0,1,0,9;ED=intron       GT:AD   1/2:12,10,8
chr10   5545892 rs111166022  T    C    .     .       DP=14;AN=3;NC=0,13,0,1;ED=intergenic  GT:AD   1/2:12,10,8
chr10   5611310 rs9423447    T    C    .     .       DP=24;AN=3;NC=0,24,0,0;ED=intergenic  GT:AD   1/2:12,10,8
chr10   6061277 rs7069976    G    A    .     .       DP=22;AN=3;NC=22,0,0,0;ED=intron      GT:AD   1/2:12,10,8
...
	

VCF format

For details on the VCF format please refer to the 1000 Genomes Project website. If the VCF file contains GENOTYPE fields, only the first one is used (to decide if the variant is homozygous or heterozygous).
Example for a file in VCF format:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO                                                                            FORMAT          SAMPLE1
chr1    10126   .       T       C       4.13    .       DP=4;VDB=0.0041;AF1=0.5005;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-4.1;PV4=1,0.22,1,0.024   GT:PL:DP:SP:GQ  0/1:32,0,26:4:0:29
chr1    10127   .       A       T       4.13    .       DP=4;VDB=0.0041;AF1=0.5006;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-4.62;PV4=1,0.23,1,0.026  GT:PL:DP:SP:GQ  0/1:32,0,25:4:0:29
chr1    10128   .       A       T       4.13    .       DP=4;VDB=0.0041;AF1=0.5005;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-4.1;PV4=1,0.22,1,0.029   GT:PL:DP:SP:GQ  0/1:32,0,26:4:0:29
chr1    10129   .       C       A       4.14    .       DP=4;VDB=0.0041;AF1=0.5014;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-6.5;PV4=1,0.25,1,0.032   GT:PL:DP:SP:GQ  0/1:32,0,22:4:0:27
chr1    10130   .       C       A       5.46    .       DP=4;VDB=0.0041;AF1=0.5004;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-3.42;PV4=1,0.23,1,0.035  GT:PL:DP:SP:GQ  0/1:34,0,27:4:0:30
chr1    10132   .       T       C       4.77    .       DP=4;VDB=0.0041;AF1=0.5003;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-3.1;PV4=1,0.21,1,0.044   GT:PL:DP:SP:GQ  0/1:33,0,28:4:0:30
chr1    10133   .       A       T       3.54    .       DP=4;VDB=0.0041;AF1=0.5002;AC1=1;DP4=1,0,3,0;MQ=60;FQ=-3.4;PV4=1,0.2,1,0.05     GT:PL:DP:SP:GQ  0/1:31,0,28:4:0:29
chr1    10135   .       C       A       3.55    .       DP=5;VDB=0.0014;AF1=0.5011;AC1=1;DP4=1,0,4,0;MQ=60;FQ=-5.91;PV4=1,0.14,1,0.078  GT:PL:DP:SP:GQ  0/1:31,0,23:5:0:27
chr1    10136   .       C       A       4.13    .       DP=5;VDB=0.0014;AF1=0.5006;AC1=1;DP4=1,0,4,0;MQ=60;FQ=-4.62;PV4=1,0.14,1,0.088  GT:PL:DP:SP:GQ  0/1:32,0,25:5:0:29
chr1    10139   .       A       T       4.78    .       DP=5;VDB=0.0014;AF1=0.5015;AC1=1;DP4=1,0,4,0;MQ=60;FQ=-6.44;PV4=1,0.16,1,0.13   GT:PL:DP:SP:GQ  0/1:33,0,22:5:0:27
chr1    10141   .       C       A       4.14    .       DP=5;VDB=0.0014;AF1=0.5011;AC1=1;DP4=1,0,4,0;MQ=60;FQ=-5.82;PV4=1,0.15,1,0.17   GT:PL:DP:SP:GQ  0/1:32,0,23:5:0:27
...
	

dbSNP Ids

There are two ways of providing a list of dbSNP ids for analysis:

In either case, the dbSNP ids must have a leading 'rs'.
Only "rs" numbers annotated in the ElDorado database are accepted. Currently, Genomatix ElDorado database only contains dbSNP ids of SNPs (no InDels). Example: rs78108388, rs76375046, rs9944449


Types of Analysis

Genomic classification

The genomic context of each variant is identified and the variants are assigned to one or several of the following classes:

Note that the categories overlap due to strand specificity and alternative splicing. In particular the transcript-based classes (CDS, UTR, intron, exon) and promoter, as well as the promoter and intergenic categories can overlap. Thus, a variant can be in the classes "promoter" and "5'UTR" at the same time, for example. Additionally, a deletion can affect two adjacent genomic elements, e.g. an exon and an intron. Insertions are classified according to the position of the "anchor" nucleotide, i.e. the base after which the sequence is inserted.

Amino acid effects

Variants within the coding sequence of a transcript can have an effect on the respective amino acid sequence. For each transcript containing a variant within its CDS it is checked if the exchange of the SNP nucleotides, resp. the deletion or insertion of a sequence causes a change of the amino acid.

The amino acid changes are separated into the following categories:

As transcripts may overlap, one variant can affect more than one coding sequence and consequently may fall into several categories (the same SNP can be 'missense' for one transcript, but 'synonymous' for another).

Transcription Factor binding site effects

For the analysis of TF binding site effects of the variants, MatInspector is used. For each variant the genomic sequence 40bp of the variant position is checked for matches to matrices from the Genomatix matrix library with the variant alleles sequentially inserted into the sequence. The two resulting lists of matrix matches are then compared. The matrix library version can be selected. The matrix library subsections used for the analysis are automatically selected based on the input organism, e.g. for human, the vertebrate matrix section is used. Core promoter elements (matrix group 'others') are always used. Default values for core and matrix similarity are applied for the search. See the help pages for details on MatInspector or the matrix library and matrix subsections.


Parameters

Variant analysis parameters
Input data file The input data must have one of the formats described above.

You have the option to

  • select from the list of VCF files previously uploaded to the ProjectManagement
  • upload a new file from your local computer
  • import a new file from the GMS (more)
  • import a new file from the GGA (more)
If you upload or import a VCF file from your local computer resp. GMS/GGA, this file will automatically be added to your ProjectManagement file stock.

Regardless of uploading or importing a new file, you'll have to choose the correct organism for the input file and this must be the same organism as was specified for the SNP resp. InDel detection.

Your currently selected ElDorado version also applies to the analysis. You can set the ElDorado version in the combo box on the upper right of the input page. Again, you must choose the same setting as in the SNP/InDel calling.

dbSNP id list input Besides the file upload, dbSNP ids can also be uploaded using the text area.

Specifying an organism is NOT required if you use this upload option. Instead, the organism is derived from the first dbSNP id entered into the text area.

On the other hand, your currently selected ElDorado version applies to the analysis, just as when using the file upload. You can set the ElDorado version in the combo box on the upper right of the input page.

Analysis Options You may select which analysis steps should be performed:
  • Classification of variants (exonic, intronic, intergenic, ...)
    Perform the Genomic classification of the input variants
  • Amino Acid changes for exonic variants
    Examine the amino acid effects the variants might have.
    Warning: Checking the include amino acid sequence in output option will considerably increase the size of the output file!
  • Analysis of TF binding sites changes
    Check for Transcription Factor binding site effects caused by the variants. You can select the matrix library version used for this analysis step. By default, the latest matrix library is selected (please see the Library Statistics and the Library Release Notes)
The results for each selected analysis type will be printed to a different output file.
Output
Result Here, you can edit the default name of the result file.
Email address An email with the URL of the results will be sent to the provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


Output

The output generally consists of several sections:

  1. Analysis parameters
  2. General statistics
  3. Genomic classification
  4. Amino acid change analysis
  5. TF binding site analysis
  6. Download of Data Files

Notes on the nomenclature

VariantAnalysis assigns an identifier to each variant. If the program input consisted of dbSNP ids, the dbSNP id is used, otherwise an identifier is created, consisting of the chromosome and position of the variant, separated by an underscore, e.g. '1_68266'. If this chromosome/position combination is not unique (a VCF file may contain different entries at the same position), an underscore character and a number (starting at '2') are appended to the string, e.g. '1_68266_2' This way, different VariantAnalysis results can be compared, if they have the same settings for ElDorado version and organism.

Amino acids are denoted using the 1-letter code. For stop codons, a '*' is used.

The result file for the amino acid change analysis shows two columns 'HGVS cDNA' and 'HGVS protein'. They refer to the Human Genome Variation Society (HGVS) recommendations for the description of DNA sequence variants (version 2.0) on cDNA level, resp. protein level:

For details on the HGVS nomenclature see the HGVS website or the paper describing the HGVS nomenclature

1. Analysis Parameters

A listing of all parameters, in particular:

2. General statistics

General statistics table

3. Genomic classification

This section is available only if the Classification of variants option was selected from the Analysis Options parameter.

Genomic classification table

The table lists for each of the nine categories the number of variants in absolute and percent (relative to the total number of variants) values. A variant can belong to several categories, so the numbers usually do not sum up to the total number of variants. See also the section Genomic classification.

Result file

You may save the result of the genomic classification by clicking the "Download file with classification details" button.
The result file has the following columns (left to right):
  1. Variant Id: Unique identifier for the variant, consisting of chromosome and position, separated by an underscore; in case of dbSNP id input, this is the dbSNP id; use this value to cross-reference between the different output files.
  2. dbSNP Id: If a dbSNP annotation is available, its id is denoted here ('rs' follwed by digits).
  3. Chromosome: 'chr' followed by digits or characters, e.g. 'chr1', 'chrX', 'chrMT'
  4. Type: For each alternative allele, the type of variant of the allele ('SNP', 'deletion' or 'insertion'), the allele position (which can slightly differ from the variant position), optionally followed by the zygosity of the allele, followed by either the SNP alleles or the InDel sequence, e.g. 'SNP [768161] (het) A/C', 'deletion [62255] CA'
  5. Position: position of the variant on the chromosome
  6. Read depth: coverage at the position of the variant
  7. Quality: quality score for the variant call, as given in the input VCF file
  8. Annotation: genomic classification, one of "intergenic", "promoter", "3'UTR", "5'UTR", "CDS", "intron", "intron (canonical splicing)", "exon (no ORF)", "microRNA"
  9. Splice site distance: distance in basepairs from the variant position to the nearest splice site (if there are several alleles for the variant, the minimum of the distances of the alleles is used); in this context, a splice site is defined as the first or last position of an exon; if the splice site is in 3' direction of the variant, the splice site distance is given as a positive number,if the splice site is in 5' direction, the distance is given as a negative number; the column value is only set if 'Annotation' is "3'UTR", "5'UTR", "CDS", "exon (no ORF)", "intron" or "intron (canonical splicing)", otherwise the column is empty
  10. Canonical splicing affected: '1' if the variant resides on (or, in case of a deletion, affects) the leading or trailing two nucleotides of an intron containing the pattern identifying canonical splicing, '0' otherwise; in other words, the value in this column is '1' if 'Annotation' is 'intron (canonical splicing)' and 'Splice site distance' is '1', '-1', '2', '-2' or '0' (in case of a deletion, and the splice site is within the deleted region, i.e. has a distance of 0 to the splice site)
  11. Element Id: not available for intergenic variants; Genomatix promoter id ('GXP_' followed by digits) if 'Annotation' is 'promoter', microRNA id if 'Annotation' is 'microRNA', otherwise Genomatix transcript id ('GXT_' followed by digits)
  12. Accession number: only available if 'Element Id' is a transcript id, in this case the corresponding cDNA accession number
  13. Locus Id: Genomatix locus id ('GXL_' followed by digits)
  14. Symbol: Gene Symbol (this can be several gene symbols, separated by '/')
  15. Gene Id: NCBI Entrez Gene (this can be several gene ids, separated by '/')
The information in this file is given in one line per genomic element, transcript or promoter. This means that the column 'Element Id' contains only one transcript or promoter id (if available, i.e. for non-intergenic variants).
Example output file: Example classification output file

4. Amino acid change analysis

This section is available only if the Amino Acid changes for exonic variants option was selected from the Analysis Options parameter.

Amino acid change table

Result file

You may save the result of the genomic classification by clicking the "Download file with amino acid change details" button.
The result file has the following columns (left to right):
  1. Variant Id: Unique identifier for the variant, consisting of chromosome and position, separated by an underscore; in case of dbSNP id input, this is the dbSNP id; use this value to cross-reference between the different output files.
  2. dbSNP Id: If a dbSNP annotation is available, its id is denoted here ('rs' follwed by digits).
  3. Transcript Id: Genomatix transcript id ('GXT_' followed by digits) of the transcript affected by the variant, resp. the alternative allele
  4. Transcript source: original source for the transcript, e.g. 'NCBI GenBank', 'ensemble'
  5. Accession number: cDNA accession number corresponding to the transcript
  6. Symbol:Gene symbol for the gene (not available for all transcripts, i.e. the field can be empty; on the other hand, there can be several gene symbols, separated by '/')
  7. Strand: Strand ('+' or '-') of the transcript
  8. Contig position: position of the allele, refers to the first base in the 'Ref allele' column
  9. Ref allele: sequence of the reference allele (for insertions, this is the anchor base) (inverted, if 'Strand' is '-')
  10. Alt allele: sequence of the alternative allele (for deletions, this columns is empty) (inverted, if Strand is '-')
    Note: If an insertion has an effect on a transcript on the antisense strand, the alternative allele is inverted, including the anchor base. This means that in this case the (inverted) anchor nucleotide is at the end of the shown sequence.
  11. CDS position: the contig position transformed to the corresponding position within the CDS (empty, if 'Category' is 'splice-site')
  12. Protein position: position within the amino acid sequence, where the effect caused by the variant occurs (empty, if 'Category' is 'splice-site')
  13. Ref amino acid: (1-letter) amino acid sequence of the reference allele ('*' for stop codons; empty, if 'Category' is 'splice-site')
  14. Alt amino acid: (1-letter) amino acid sequence of the alternative allele ('*' for stop codons; empty, if 'Category' is 'splice-site')
  15. HGVS cDNA: see the Notes on the nomenclature
  16. HGVS protein: see the Notes on the nomenclature
  17. Category: one of the categories 'initiating', 'missense', 'nonsense', 'synonymous', 'read-through', 'deletion', 'insertion', 'frameshift', 'splice-site'
  18. Amino acid sequence: contains the complete amino acid sequence for this transcript after the effect caused by the variant (resp. the alternative allele) was applied.
    This column is only present if the "include complete amino acid sequence" option has been selected on the parameter page.
This file contains one line for each effect on a coding sequence (resp. intron, if the category is 'splice-site'), i.e. for each allele/transcript combination.
Example output file: Example amino acid change analysis output file

5. TF binding site analysis

This section is available only if the Analysis of TF binding sites changes option was selected from the Analysis Options parameter.

TF binding site table

Result file

You may save the result of the genomic classification by clicking the "Download file with TF analysis details" button.
The result file has the following columns (left to right):
  1. Variant Id: Unique identifier for the variant, consisting of chromosome and position, separated by an underscore; in case of dbSNP id input, this is the dbSNP id; use this value to cross-reference between the different output files.
  2. dbSNP Id: If a dbSNP annotation is available, its id is denoted here ('rs' follwed by digits).
  3. Chromosome: 'chr' followed by digits or characters, e.g. 'chr1', 'chrX', 'chrMT'
  4. Annotation: genomic classification, blank-separated list of the strings 'intergenic', 'promoter', 'intron', 'exon'
  5. Ref allele: sequence of the reference allele (for insertions, this is the anchor base)
  6. Alt allele: sequence of the alternative allele (for deletions, this columns is empty)
  7. Matrix: name of the matching TF binding site matrix, e.g. 'V$PSE.02'
  8. Matrix family: name of the matrix family of 'Matrix', e.g. 'V$SNAP'
  9. Start: start position of the matrix match (on chromosome)
  10. End: end position of the matrix match (on chromosome)
  11. Strand: strand of the matrix match
  12. Core sim.: core similarity of the matrix match
  13. Matrix sim.: matrix similarity of the matrix match
  14. Change: '1' if the matrix match was found after replacing the 'Ref allele' by the 'Alt allele', resp. '0' if the match was lost; the value in this column defines how to interpret columns 7 through 13: for new sites (Change = 1), these columns refer to the newly found matrix match, for lost sites (Change = 0), they refer to the lost site (which was present before 'Ref allele' was replaced by 'Alt allele')
Example output file: Example TF binding site analysis output file

6. Save BED file to ProjectManagement

Saving the BED file to the ProjectManagement makes the variant data available e.g. for the Genome Browser.

7. Download of Data Files

Here you can download a tarball containing all result files, in particular: