Genomatix-Logo
Overview of Help-Pages

BAM File Toolbox


[Introduction] [Input] [Actions]

Introduction

The BAM file toolbox provides a number of tools that are often needed when handling BAM files. The following actions can be performed:
  1. BAM file statistics
  2. Filtering BAM files by quality
  3. Pileup removal
  4. Merging/concatenating BAM files
  5. Conversion from BAM to BED/bigBed

General Parameters for the BAM file toolbox

Input
Input
Input data are accepted in BAM file format containing the input regions. Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BAM file to the list (by clicking "Add BAM file...")

When adding a new file, a new window will open, asking you to either

  • upload one or several BAM files from your local computer
  • or import one or several BAM files from the GMS (see more details)
  • or import one or several BAM files from the GGA (see more details)
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file.

Available Tasks for BAM files

image of available actions


BAM file statistics

After selecting one or several BAM files, the BAM files are analyzed and all reads are checked. The following analyses will be performed:

Please note that for large BAM files or several BAM files the email option is recommended.

The result will be available from the Project Management, when clicking on the BAM file. The following data will appear in the output:

The data in the tabs (e.g. "Chromosome Distribution") is displayed graphically. It can either be downloaded in various graphic formats (PNG, JPEG, PDF, SVG) or as tab-separated text file.

Tab: Chromosome Distribution

This graphics shows the relative distribution of the reads on the chromosomes.

Tab: Pileup Distribution

Pileups of sequence reads are clusters of reads where the exact same sequence gets mapped onto the same position of the reference. These reads might be artifacts due to experimental flaws like PCR duplicates and it is generally recommended to discard them. The distributions allows you to get an idea how many pileups have been found and how many sequences these contain. This can be helpful to determine the maximum number of sequences that are allowed at the exact same position without discarding them as a pileup. A widely used threshold is the 0.95 quantile which is given on the upper right. Any pileups containing more sequences than this number should be discarded.

Tab: Mapping Quality

This image shows the mapping quality distribution of mapped reads. The y-axis denotes the proportion of reads, the x-axis the quality of the reads.
Mapping quality scores are a measure for the confidence that the read is correctly placed. For example, at a mapping quality of 20, there is a 1 in 100 chance that the read truly originated elsewhere. A value 255 indicates that the mapping quality is not available.
Please note that different mappers/aligners use different methods and scores for assigning mapping quality, so files from different aligners can not easily be compared based on the score.
The distribution itself can be helpful, though, as you can get an idea of the proportion of high quality reads within your file.

Tab: For Paired End Data: Insert Size

The distribution of found distances between mates of read pairs. The table on the top right lists some statistical values:

If the mean of the distribution shown differs considerably from the insert size aimed for in your experiment this can be an indication that something went wrong in the wet-lab experiment.

Tab: For Paired End Data: Strand orientation

A pie chart of the strand orientation of the read pairs: The strand orientation should also reflect what you expect from your wet-lab experiment. If you expect reads to be on the same strand and the majority shows up on different strands here, something might be wrong.

Tab: Optionally: Read Annotation

The distribution of reads according to their genomic annotation (exon, intron, promoter, intergenic region) is displayed graphically. See an example for the graphical representation (Enrichment and General).

Tab: BAM file / Download

The first 100 lines of the BAM file are displayed for control purposes. At the bottom of the page, there is a link to download the complete BAM file to the local computer.


Filtering BAM files by quality

After selecting one or several BAM files, the BAM files are filtered by the given minimum mapping quality (mapQ value in the BAM file).
A reduced BAM file containing only those reads that have the minimum mapping quality will be automatically saved into the result management. The filename will have an extension showing the filter level (e.g. "test_qual_20.bam" for an input file called "test.bam" if the min. quality was set to 20). A click on the filename will display more information on the resulting BAM file, including a download option.

Example output in interactive mode:
Working on test.bam...

   3991384 total regions (3991384 valid regions) in input file
   3464887 regions after filtering to min. mapping quality of 20
    526497 regions were removed
The BAM file was saved as test_qual_20.bam

Mapping quality scores quantify the probability that a read is misplaced and were introduced by Heng Li and Richard Durbin in 2008. It is related to uniqueness. The greater the quality distance between the best alignment hit and the second best alignment hit, the more unique the best alignment, and the higher its mapping quality should be.
The mapping quality should usually be between 0 and 60. For example, a mapping quality of 10 or less indicates that there is at least a 1 in 10 chance that the read truly originated elsewhere. A value 255 indicates that the mapping quality is not available.


Please note that for large BAM files or several BAM files the email option is recommended.
Optionally, the BAM file statistics task can automatically be run on the resulting filtered file to obtain the corresponding graphics.


Pileup removal

This option removes all read duplicates from the selected BAM file(s) (using samtools rmdup). The resulting BAM file(s) are automatically saved to the result management with the suffix "_rmdup.bam".

Example output in interactive mode:
removing duplicates from BAM file SRR504515_NMNAT1.bam...
4807 mapped regions before pileup removal
2765 reads in file after pileup removal
The BAM file was saved as SRR504515_NMNAT1_rmdup.bam

Please note that for large BAM files or several BAM files the email option is recommended.
Optionally, the BAM file statistics task can automatically be run on the resulting file to obtain the corresponding graphics.


Merging BAM files

If two of more BAM files are selected, this option allows merging the input files to one BAM file which will be saved into the users project management. A result name can be supplied (if a BAM file with such a name already exists, the name will get a suffix e.g. if "new.bam" exists, "new_1.bam" will be created).
Optionally, the BAM file statistics task can automatically be run on the resulting file to obtain the corresponding graphics.


Conversion from BAM to BED/bigBed format

Upload a BAM file or select an uploaded BAM file to convert it to BED or bigBed format.
The conversion takes care of spliced aligments: Alignments containing skipped regions ('N' in the CIGAR string) result in several data records in the BED file. The 'QNAME' is used as the value for the 'id' (aka 'name') column in the BED file. The mapping quality ('MAPQ') given in the BAM file is used as the 'score' value in the BED file.

Example: The alignment

QNAME   FLAG  RNAME  POS  MAPQ  CIGAR    
ID_XYZ  16    chr10  63   0     19M93N31M
is transformed into two BED file records:
chrom  start  end  id      score  strand
chr10  62     81   ID_XYZ  0      -
chr10  174    205  ID_XYZ  0      -
Note: The strand is taken 'as is' from the BAM file. If a strand-specific sequencing protocol was applied, strand-specificity for one partner of paired-end reads might be lost.