Genomatix-Logo
Overview of Help-Pages
RegionMiner

NGS Analyzer: Clustering and Analysis of Tags from ChIP-Seq/DGE experiments
(only available on GGA)


[Introduction] [Parameters] [Output]

Introduction

NGSAnalyzer is a program to analyze short sequence reads from next generation sequencing experiments.

Several RegionMiner tasks (Complete ChIPSeq Workflow, Expression Analysis for RNASeq Data, Clustering NGS Data ) use NGSAnalyzer for clustering and/or expression analysis.

For the analysis, NGSAnalyzer requires the genomic position of the individual reads. For both types of experiments the sequence reads are classified by their association with features from the ElDorado genome annotation (exon, intron, partial, promoter, intergenic region). The results of the classification and the distribution of the reads to the individual chromosome are summarized in a statistical overview.

In the second step the reads are analyzed for local enrichments (cluster) representing genomic regions bound by protein (ChIP-Seq) or being expressed (RNA-Seq). The threshold applied by the clustering algorithm reflects the density of the data set assuming a Poisson distribution. For ChIP-Seq experiments data from an additional "input-control" experiment can be provided. Unspecific enrichments detected in the "input-control" data are then subtracted from the ChIP experiment. The clusters are classified and summarized in a statistical overview. In a third step, a normalized expression/enrichment value (NE-value) is calculated for each transcript/cluster. For RNA-Seq, NE-values are provided for the whole transcript and for the most and least expressed exon of the transcript. The results are again summarized in a statistical overview.

Normalized expression/enrichment value (NE-value)
The NE-value is calculated based on the following formula:
NE = c * #readsregion / (#readsmapped * lengthregion)
where NE is the normalized expression or enrichment value,
#readsregion: the reads (sum of base pairs) falling into either the transcript or the cluster region,
#readsmapped: all mapped reads (in base pairs),
lengthregion: the transcript or cluster length in base pairs
and c a normalization constant set to 107.

Part of the NGSAnalyzer functionality is described in:

Sultan M, et al (2008)
A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome
Science 321 (5891), 956-960


Parameters

NGS Analyzer parameters
Window size for clustering: The window size used for the clustering algorithm in basepairs. The default window size is 100 bp.
Minimum number of reads per cluster A threshold for the clustering algorithm. By default (value -1), this number is calculated from the data set applying a Poisson distribution. Otherwise, values above 3 are allowed here.

Output ('Complete Clustering Results')

NGS Analyzer produces a set of files containing the results.
Generally, the log-file and the read statistics are shown on the screen, and the complete set of files can be downloaded as an archive (tar-file).
Additionally, the resulting BED files and sequence files can be downloaded directly, e.g. to be used in other RegionMiner or GEMS Launcher tasks.

Just use the download button on the bottom of the page to get the files:

download options for NGS analyzer results

The content and the structure of the individual files is described in detail below.

analysis.log (00)

The file contains the information from the parameter file (input file, window size, threshold). It provides some basic numbers about the analysis (lines read from input file, number of reads analyzed, number of clusters detected). If the analysis was interrupted the reason is stated here.

read_statistics (02)

The file contains the statistical summary of the individual reads. It provides three types of information:

1. length distribution of the reads
2. correlation of the reads with the genomic elements (from read_classification). 
   The information which fraction of the genome belongs to a specific class is provided, too.
3. distribution of the reads to the individual chromosomes.

cluster (03)

The .tsv file contains 9 columns (tab-separated) for each detected cluster.
1 : cluster id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the cluster
6 : end position of the cluster
7 : length of the cluster
8 : number of reads in the cluster
9 : normalized expression/enrichment value (NE) calculated for the cluster

The numbers in the file name (e.g. 100-11 or 100-7 denote the window size and the threshold used for clustering.

Example:

Window size: 100bp      Threshold: 11
Id   Contig     Chromosome  Strand  Start   End     Length  #Reads  NE
 0   NC_000001  chr1        0       554310  558436  4127    7228    0.80225
 1   NC_000001  chr1        0       558569  560173  1605    2489    0.71036
 2   NC_000001  chr1        0       703750  703926  177     17      0.04399
The corresponding .bed file contains the genomic positions of the clusters in BED-file format (e.g. for upload in RegionMiner)

cluster_classification (04)

The file provides the genomic classification for each cluster.

The file contains 7 columns (tab-separated) for each region:

  1 : read id
  2 : contig/chromosome accession number
  3 : chromosome
  4 : strand
  5 : start position of the read
  6 : end position of the read
  7 : genomic elements the read is associated with
      intergenic (intergenic region)
      exon
      intron
      partial (overlapping with exon)
      promoter
                An individual read is assigned to one of the four classes
                intergenic, exon, intron, partial and can be assigned to
                the class promoter in addition.
1428	NC_000001	chr1	0	1348237	1348455	intergenic
1429	NC_000001	chr1	0	2311642	2311953	promoter intron
1430	NC_000001	chr1	0	2450272	2450768	exon
1431	NC_000001	chr1	0	2469265	2469512	intergenic
1432	NC_000001	chr1	0	3556424	3556796	promoter partial
1433	NC_000001	chr1	0	3614623	3614831	exon

read_statistics (05)

The file contains the statistical summary of the individual clusters.
1. how many clusters were detected
2. how many reads are located in clusters
3. statistics about size and density of the clusters
4. correlation of the reads with the genomic elements (from read_classification)
The information which fraction of the genome belongs to a specific class is provided, too.

cluster_sequences (06)

The file contains the genomic sequence of each cluster (in FASTA format).

expression_profile (07)

The file contains 16 columns (tab-separated) for each transcript annotated in Eldorado.
1: transcript ID (Eldorado)
2: accession number of the transcript (external e.g. RefSeq, Genbank, Ensembl)
3: locus ID (Eldorado)
4: symbol of the gene
5: gene ID (NCBI Entrez Gene, 0 if not available, -2 if ambiguous)
6: contig/chromosome accession number
7: chromosome
8: strand
9: start position of the transcript
10: end position of the transcript (start < end)
11: length of the transcript (sum of exons)
12: number of exons
13: number of reads in all exons
14: normalized expression value for the least expressed exon
15: normalized expression value for the most expressed exon
16: normalized expression value for the whole transcript

Example:

TranscriptId   Accn              LocusId     Symbol  GeneId   ...
GXT_2840407    NM_001961         GXL_3498    EEF2      1938   ...
GXT_22752079   ENST00000377795   GXL_103329  CD74       972   ...
GXT_22533977   AK161993          GXL_3498    EEF      21938   ...



...   ContigAccn   Chromosome   Strand     Start         End  ...
...   NC_000019    chr19        -        3927054     3936461  ...
...   NC_000005    chr5         -      149761426   149772685  ...
...   NC_000019    chr19        -        3927056     3931539  ...



...   Transcript length   #exons   #reads   min.NE (exon)   max.NE (exon)   NE (transcr.)
...                3158       15    25608         1261580         2882036         2142615
...                1265        6     9677          887983         2577712         2021301
...                1755        7    13416          160425         2882036         2019885
The corresponding .bed file contains the genomic positions of the transcripts in BED-file format (e.g. for upload in RegionMiner).

The '07_expression_profile_top.list' file contains the GeneIDs of the top 5,000 expressed genes, the normalized expression value of their highest scoring transcript and the gene symbol. (This file can be used for upload in GePS).

The '07_expression_profile_all.list' file contains the GeneIDs of all genes annotated in ElDorado, the normalized expression value of their highest scoring transcript (if existent) and the gene symbol.