![]() |
![]() |
NGSAnalyzer is a program to analyze short sequence reads from next generation sequencing experiments.
Several RegionMiner tasks (Complete ChIPSeq Workflow, Expression Analysis for RNASeq Data, Clustering NGS Data ) use NGSAnalyzer for clustering and/or expression analysis.
For the analysis, NGSAnalyzer requires the genomic position of the individual reads. For both types of experiments the sequence reads are classified by their association with features from the ElDorado genome annotation (exon, intron, partial, promoter, intergenic region). The results of the classification and the distribution of the reads to the individual chromosome are summarized in a statistical overview.
In the second step the reads are analyzed for local enrichments (cluster) representing genomic regions bound by protein (ChIP-Seq) or being expressed (RNA-Seq). The threshold applied by the clustering algorithm reflects the density of the data set assuming a Poisson distribution. For ChIP-Seq experiments data from an additional "input-control" experiment can be provided. Unspecific enrichments detected in the "input-control" data are then subtracted from the ChIP experiment. The clusters are classified and summarized in a statistical overview. In a third step, a normalized expression/enrichment value (NE-value) is calculated for each transcript/cluster. For RNA-Seq, NE-values are provided for the whole transcript and for the most and least expressed exon of the transcript. The results are again summarized in a statistical overview.
NE = c * #readsregion / (#readsmapped * lengthregion)where NE is the normalized expression or enrichment value,
Part of the NGSAnalyzer functionality is described in:
Sultan M, et al (2008)
A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome
Science 321 (5891), 956-960
| NGS Analyzer parameters | |
|---|---|
| Window size for clustering: | The window size used for the clustering algorithm in basepairs. The default window size is 100 bp. |
| Minimum number of reads per cluster | A threshold for the clustering algorithm. By default (value -1), this number is calculated from the data set applying a Poisson distribution. Otherwise, values above 3 are allowed here. |
NGS Analyzer produces a set of files containing the results.
Generally, the log-file and the read statistics are shown on the screen,
and the complete set of files can be downloaded as an archive (tar-file).
Additionally, the resulting BED files and sequence files can be downloaded
directly, e.g. to be used in other RegionMiner or GEMS Launcher tasks.
Just use the download button on the bottom of the page to get the files:

The content and the structure of the individual files is described in detail below.
The file contains the information from the parameter file (input file, window size, threshold). It provides some basic numbers about the analysis (lines read from input file, number of reads analyzed, number of clusters detected). If the analysis was interrupted the reason is stated here.
The file contains the statistical summary of the individual reads. It provides three types of information:
1. length distribution of the reads 2. correlation of the reads with the genomic elements (from read_classification). The information which fraction of the genome belongs to a specific class is provided, too. 3. distribution of the reads to the individual chromosomes.
1 : cluster id 2 : contig/chromosome accession number 3 : chromosome 4 : strand 5 : start position of the cluster 6 : end position of the cluster 7 : length of the cluster 8 : number of reads in the cluster 9 : normalized expression/enrichment value (NE) calculated for the cluster
The numbers in the file name (e.g. 100-11 or 100-7 denote the window size and the threshold used for clustering.
Window size: 100bp Threshold: 11 Id Contig Chromosome Strand Start End Length #Reads NE 0 NC_000001 chr1 0 554310 558436 4127 7228 0.80225 1 NC_000001 chr1 0 558569 560173 1605 2489 0.71036 2 NC_000001 chr1 0 703750 703926 177 17 0.04399The corresponding .bed file contains the genomic positions of the clusters in BED-file format (e.g. for upload in RegionMiner)
The file contains 7 columns (tab-separated) for each region:
1 : read id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the read
6 : end position of the read
7 : genomic elements the read is associated with
intergenic (intergenic region)
exon
intron
partial (overlapping with exon)
promoter
An individual read is assigned to one of the four classes
intergenic, exon, intron, partial and can be assigned to
the class promoter in addition.
1428 NC_000001 chr1 0 1348237 1348455 intergenic 1429 NC_000001 chr1 0 2311642 2311953 promoter intron 1430 NC_000001 chr1 0 2450272 2450768 exon 1431 NC_000001 chr1 0 2469265 2469512 intergenic 1432 NC_000001 chr1 0 3556424 3556796 promoter partial 1433 NC_000001 chr1 0 3614623 3614831 exon
1. how many clusters were detected 2. how many reads are located in clusters 3. statistics about size and density of the clusters 4. correlation of the reads with the genomic elements (from read_classification)The information which fraction of the genome belongs to a specific class is provided, too.
1: transcript ID (Eldorado) 2: accession number of the transcript (external e.g. RefSeq, Genbank, Ensembl) 3: locus ID (Eldorado) 4: symbol of the gene 5: gene ID (NCBI Entrez Gene, 0 if not available, -2 if ambiguous) 6: contig/chromosome accession number 7: chromosome 8: strand 9: start position of the transcript 10: end position of the transcript (start < end) 11: length of the transcript (sum of exons) 12: number of exons 13: number of reads in all exons 14: normalized expression value for the least expressed exon 15: normalized expression value for the most expressed exon 16: normalized expression value for the whole transcript
TranscriptId Accn LocusId Symbol GeneId ... GXT_2840407 NM_001961 GXL_3498 EEF2 1938 ... GXT_22752079 ENST00000377795 GXL_103329 CD74 972 ... GXT_22533977 AK161993 GXL_3498 EEF 21938 ... ... ContigAccn Chromosome Strand Start End ... ... NC_000019 chr19 - 3927054 3936461 ... ... NC_000005 chr5 - 149761426 149772685 ... ... NC_000019 chr19 - 3927056 3931539 ... ... Transcript length #exons #reads min.NE (exon) max.NE (exon) NE (transcr.) ... 3158 15 25608 1261580 2882036 2142615 ... 1265 6 9677 887983 2577712 2021301 ... 1755 7 13416 160425 2882036 2019885The corresponding .bed file contains the genomic positions of the transcripts in BED-file format (e.g. for upload in RegionMiner). The '07_expression_profile_top.list' file contains the GeneIDs of the top 5,000 expressed genes, the normalized expression value of their highest scoring transcript and the gene symbol. (This file can be used for upload in GePS). The '07_expression_profile_all.list' file contains the GeneIDs of all genes annotated in ElDorado, the normalized expression value of their highest scoring transcript (if existent) and the gene symbol.
| © 1998-2011 Genomatix Software GmbH - All rights reserved |