Genomatix-Logo
Overview of Help-Pages

Genomatix: Genomatix small RNA library


[Sources] [Background]

Source of small RNAs

The Genomatix small RNA library contains over 0.25 million sequences of noncoding RNAs collected from a number of pertinent public databases. The following table shows these sources:

code in library Database name version in ElDorado Link
08-2011 12-2012 12-2013 06-2015 02-2016 12-2016
MRBn miRBase V17 V19 V20 V21 V21 V21 www.mirbase.org
RFAM Rfam 10 11 11 11 11 11 rfam.sanger.ac.uk
TRNA GtRNAdb 2010 20112011201120112011 gtrnadb.ucsc.edu
LBME snoRNAbase 2 3 3 3 3 3 www-snorna.biotoul.fr
JSMP RNAdb 1.0 2.0 (final)2.0 (final)2.0 (final)2.0 (final)2.0 (final) jsm-research.imb.uq.edu.au/rnadb

The sequences were corrected for double entries and stratified into 10 classes of noncoding RNA according to the sequence ontology project (sequenceontology.org). The following table shows the sequence count for each class:

ncRNA sequence type number of sequences in ElDorado
08-2011 12-2012 12-2013 06-2015 02-2016 12-2016
mature microRNA 10,668 8,385 10,729 11,716 11,716 11,716
hairpin microRNA 9,057 9,776 10,873 11,695 11,695 11,695
piRNA 176,194 166,829 166,829 166,829 166,829 166,829
rRNA 83,667 8,572 8,572 8,572 8,572 8,572
snoRNA 27,612 26,851 26,851 26,851 26,851 26,851
snRNA 9,830 9,086 9,086 9,086 9,086 9,086
transposon 1,555 1,676 1,676 1,676 1,676 1,676
tRNA 95,650 14,900 14,900 14,900 14,900 14,900
viral RNA 1,888 1,715 1,715 1,715 1,715 1,715

Background

Non-coding RNAs are often very short. MicroRNAs, for instance, are about 21 - 23 bps long. Hence, reads longer than the RNAs itself contain linker sequences at the end(s). Such linker sequences are usually trimmed from the reads prior to the mapping. In this trimming process the linker is aligned to the reads. Here is how the Scripps Institute describes the trimming process:

The trimming process is not trivial because variability in the length of the small RNA fragments native in the cellular RNA pool results in sequence reads with the adaptor sequences starting at various positions that make a simple trimming procedure impossible.
Scripps NGS core [ http://www.scripps.edu/researchservices/ngscore/analysis.html]

Read trimming introduces several issues in the automation of data analysis. Trimming itself requires additional information from the user (e.g., the adaptor sequence). Trimming may not always uncover the correct sequence. After trimming a second read file is created, thus increasing disk space usage. Some mappers will not handle trimmed sequences of different lengths in one run. All this complicates the data analysis process. To circumvent these problems we developed a mapping procedure that directly aligns untrimmed reads to the reference. Our approach aligns reads to a non-coding RNA reference, while allowing for nucleotides extending over the 5' and 3' end of the reference. To allow this "linker tolerant" alignment, linkers of N's are added to both ends of the reference sequences. The mapping procedure ignores nucleotides overlapping with these N's for the alignment score calculation.

Mapping of long reads against short target seqences. A pre- and postfix of 200 N's is concatenated to every reference sequence.
When mapping a read to the reference, nucleotides that overlap with the N's are ignored for the scoring.

Given our set of non-coding RNA sequences, they are all concatenated with "buffer stretches" of 200 N's between any two reference sequences. This provides one long reference sequence, which is used to generate the library. In the process, any sequence of N's is ignored when building the prefix tree. This means that any seed will be anchored within an original reference sequence. In the subsequent alignment/matching phase the read might overlap with stretches of N's, in this case mismatches between the read and an N will be ignored in the scoring of the alignment. In many cases the highly conserved non-coding RNAs may be annotated with different lengths for different organisms, e.g.

hsa-miR-26a: TTCAAGTAATCCAGGATAGGCT
mdo-miR-26: TTCAAGTAATCCAGGATAGGC

This situation will result in a multiple hit in the mapping. A subsequent processing step counts the total of unique and multiple hits and results in 10 files, one for each class of non-coding RNAs. The statistical significance in the subsequent comparative microRNA workflow is calculated only after the results have been filtered by taxon.

Special attention is being paid to the microRNA processing steps. From the mirBase database, short (about 22 bp) sequences for mature, processed microRNAs can be extracted as well as longer (about 80 bp) pre-miRNA hairpin sequences. In data preparation for the Genomatix non-coding library, the hairpin sequences are mapped not only against the genome of the organism in which they were detected originally, but also against every genome in the ElDorado database. If no original microRNA hairpin is annotated near the location of this "trans-mapped" hairpin, the sequence is also annotated for this organism. Therefore, hairpin microRNA matches may be found for one organism that originally (shown by their name) have been identified in another organism.

Example:

Homo sapiens miR-103b-1 stem-loop (MI0007261; length 63 bp) matches to 100% a sequence in the mitochondrial genome of Bos taurus. It will therefore appear in an analysis of hairpin microRNAs from Bos taurus, even though its name "hsa-miR-103b-1" points to its original identification in Homo sapiens. This "cross-mapping" strategy is only performed for the hairpin sequences of microRNAs.