Genomatix-Logo
Overview of Help-Pages

Mapping NGS Reads: Genomatix Mapper


[Introduction] [Parameters] [Output]

Introduction

The Genomatix Mapper algorithm provides fast mapping and analysis of sequences from all established NGS Systems (sequence reads) to genomic or transcriptome target sequences.
The analysis results can then be used for transcript expression analysis, discovery of splicing events, SNP detection, Small InDel detection, CNV detection, Structural Variant detection, Gene fusion detection and small RNA detection. The underlying annotation is derived from the Genomatix ElDorado genome annotation system. The algorithm uses a library of the target sequences which is stored in main memory for fast access.

Mapping of a sequence read to target sequences with the Genomatix Mapper involves two steps. In the first step, seeds for potential mapping positions in the target sequence are identified via a mapping library (see below). In the second step, alignments of the complete sequence read to the previously identified positions in the target sequences are calculated. Results are ranked by their alignment score.

Mapping library

The mapping library is based on shortest unique subsequences (SUS). A SUS for a position p in the target sequence is defined as the smallest downstream word starting at position p which occurs only once in all target sequences. Thus, each position in the target sequences can be uniquely defined via a SUS and vice versa. The figure below shows the coverage of positions in the human genome against the SUS length. Over 80% of all positions in the human genome are covered by SUS with <= 25 bps length.

Coverage of the human genome sequence against the length of the shortest unique words.

The mapping library contains sequences of maximum 25 bps. If a position in the target sequence is defined by a SUS which is longer than 25 bps, then only the first 25 bps of the SUS will be stored in the library. If these (non unique) 25 bps occur less than 50 times in the target sequences, all positions will be stored. If it occurs more than 50 times, it will be stored without position coordinates but with an ambiguous flag.

Seed search

Finding seeds for alignments is done as follows: The sequence read is scanned for SUS sequences via the library. There are two search modes available which can be determined by the user:

  1. fast search
  2. deep search

The fast mode searches for exact hits of SUS in the sequence read, while the deep mode searches for SUS hits with up to one error tolerance. The error tolerance is determined via the chosen alignment (see below). If an alignment is done without insertions/deletions, all seeds will be searched which fit to the sequence read with up to one point mutation, while in case of mapping with insertion/deletions a search for seeds which map with up to one insertion or one deletion will be included.

Alignment

If the user wants to map sequence reads with insertions and deletions, the alignment of the complete source sequence to the target sequence is computed using the Needleman-Wunsch algorithm. Otherwise the alignment is based on pair wise comparison of nucleotides without consideration of insertions or deletions.

Mapping of bisulfite sequencing data

Bisulfite sequencing uses bisulfite treatment to determine the methylation pattern of the DNA. Bisulfite treatment changes C to T in case of non-methylation while methylated C remains as C. The challenge in mapping bisulfite treated sequences is the ambiguity of "T" in a sequence read which can be either a "C" or a "T". Mapping of bisulfite data is done via an adaption of the seed search and the alignment strategy which consider the ambiguity of "T". All libraries remain unchanged.

Mapping of paired sequencing data

For paired end or mate pair sequencing data the mapper first estimates the expected distance range and strand orientation of the pairs by mapping a sample set of up to 1 million sequence reads. Based on the estimated distance and strand orientation the following algorithm is applied for mapping paired end data: If reads of a sequence pair could be mapped to multiple positions on the target sequence, all resulting read pair positions are checked if they fulfill the distance/strand orientation criteria. From those that do the pair with the best alignment criteria of each sequence is chosen as the final mapping result for this sequence pair. Otherwise the pair with the overall best alignment quality is chosen. Distances given in the resulting BAM file are the outer distances of mapped mate pairs (i.e. 5'-end of the first mate to 3'-end of the second mate).

Mapping of small RNA sequencing data

If small RNA sequencing data are mapped to the small RNA library, no linker removal is needed. This speciality is due to the fact that the smallRNA library is constructed in such a way that only the overlaps of the sequence reads with the small RNA is taken into consideration.

De novo detection of splice junctions via spliced alignment

Two approaches for spliced alignments are available: A local approach and a global approach.

Local spliced alignment is based on the Genomatix ExonMapper algorithm. This algorithm discovers exon-intron structures including splice site detection in a window of 1 million base pairs via mapping of RNA sequences to genomic sequences. To detect anchors for a proper discovery of exon-intron structures a significant overlap of a sequence read with two neighbored exons is required. Thus the results will improve with the length of the sequence reads. The method is less sensitive compared to mapping against the splice junction library but will be able to annotate splicing events which are not yet known from the available annotations. Local spliced alignment is able to discover an arbitrary number of splice events from a sequence. Thus a typical application example is mapping of transcripts of any length to the genome.

Global spliced alignment searches for splice events in a genome wide manner. Exons are identified via unique seed sequences from the mapping library. In contrast to the local spliced alignment, every exon must contain at least one unique seed sequence from the mapping library. Global spliced alignment is therefore not as sensitive as the local spliced alignment or mapping against splice library but identifies genome wide spliced alignments. A typical application of global spliced alignment is the detection of gene fusions or structural variants.

De novo splice junction detection is done in two steps: read mapping and a subsequent run of spliced alignment can identify these previously unknown features.


Parameters

Please see the the parameter section of the Mapping help page for detais on the options for starting the Genomatix Mapper.

Output

Generally, a BAM file is created during a succesfull mapping run, which either contains the uniquely mapped hits (parameter 'report unique') or all input reads, regardless of their mapping status (parameter 'report all').
Additional files might be created during the mapping process, depending on parameter settings: