Genomatix-Logo
Overview of Help-Pages

NGS Analyzer: Clustering and Analysis of Tags from ChIP-Seq/DGE experiments
(only available on GGA)


[Introduction] [Parameters] [Output]

Introduction

NGSAnalyzer is a program to analyze short sequence reads from next generation sequencing experiments.

Several tasks (e.g. Complete ChIPSeq Workflow, Expression Analysis for RNASeq Data) use NGSAnalyzer for clustering and/or expression analysis.

For the analysis, NGSAnalyzer requires the genomic position of the individual reads. For both types of experiments the sequence reads are classified by their association with features from the ElDorado genome annotation (exon, intron, partial, promoter, intergenic region). The results of the classification and the distribution of the reads to the individual chromosome are summarized in a statistical overview.

In the second step the reads are analyzed for local enrichments (cluster) representing genomic regions bound by protein (ChIP-Seq) or being expressed (RNA-Seq). The threshold applied by the clustering algorithm reflects the density of the data set assuming a Poisson distribution. For ChIP-Seq experiments data from an additional "input-control" experiment can be provided. Unspecific enrichments detected in the "input-control" data are then subtracted from the ChIP experiment. The clusters are classified and summarized in a statistical overview. In a third step, a normalized expression/enrichment value (NE-value) is calculated for each transcript/cluster. For RNA-Seq, NE-values are provided for the whole transcript and for the most and least expressed exon of the transcript. The results are again summarized in a statistical overview.

Normalized expression/enrichment value (NE-value)
The NE-value is calculated based on the following formula:
NE = c * #readsregion / (#readsmapped * lengthregion)
where NE is the normalized expression or enrichment value,
#readsregion: the reads (sum of base pairs) of falling into either the transcript or the cluster region,
#readsmapped: all mapped reads (in base pairs),
lengthregion: the transcript or cluster length in base pairs
and c a normalization constant set to 107.

Part of the NGSAnalyzer functionality is described in:

Sultan M, et al (2008)
A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome
Science 321 (5891), 956-960


Parameters

NGS Analyzer parameters
Window size The window size used for the peak finding algorithm in basepairs. The default window size is 100 bp.
Minimum number of reads per peak A threshold for the peak finding algorithm. This number can be automatically calculated from the input data by applying a Poisson distribution. Otherwise, values above 3 are allowed here.
Strand specificity Strand specificity of the sequencing experiment

Output ('Complete Clustering Results')

NGS Analyzer produces a set of files containing the results.
Generally, the log-file and the read statistics are shown on the screen, and the complete set of files can be downloaded as an archive (tar-file).
Additionally, the resulting BED files and sequence files can be downloaded directly, e.g. to be used in other tasks.

Just use the download button on the bottom of the page to get the files:

download options for NGS analyzer results

The content and the structure of the individual files is described in detail below.

analysis.log (00)

The file contains the information from the parameter file (input file, window size, threshold). It provides some basic numbers about the analysis (lines read from input file, number of reads analyzed, number of clusters detected). If the analysis was interrupted the reason is stated here.

read_statistics (02)

The file contains the statistical summary of the individual reads. It provides three types of information:

1. length distribution of the reads
2. correlation of the reads with the genomic elements (from read_classification). 
   The information which fraction of the genome belongs to a specific class is provided, too.
3. distribution of the reads to the individual chromosomes.

cluster (03)

The .tsv file contains 9 columns (tab-separated) for each detected cluster.
1 : cluster id
2 : contig/chromosome accession number
3 : chromosome
4 : strand
5 : start position of the cluster
6 : end position of the cluster
7 : length of the cluster
8 : number of reads in the cluster
9 : normalized expression/enrichment value (NE) calculated for the cluster

The numbers in the file name (e.g. 100-11 or 100-7 denote the window size and the threshold used for clustering.

Example:

Window size: 100bp      Threshold: 11
Id   Contig     Chromosome  Strand  Start   End     Length  #Reads  NE
 0   NC_000001  chr1        0       554310  558436  4127    7228    0.80225
 1   NC_000001  chr1        0       558569  560173  1605    2489    0.71036
 2   NC_000001  chr1        0       703750  703926  177     17      0.04399
The corresponding .bed file contains the genomic positions of the clusters in BED-file format (e.g. for upload in other tasks)

cluster_classification (04)

The file provides the genomic classification for each cluster.

The file contains 7 columns (tab-separated) for each region:

  1 : read id
  2 : contig/chromosome accession number
  3 : chromosome
  4 : strand
  5 : start position of the read
  6 : end position of the read
  7 : genomic elements the read is associated with
      intergenic (intergenic region)
      exon
      intron
      partial (overlapping with exon)
      promoter
                An individual read is assigned to one of the four classes
                intergenic, exon, intron, partial and can be assigned to
                the class promoter in addition.
1428	NC_000001	chr1	0	1348237	1348455	intergenic
1429	NC_000001	chr1	0	2311642	2311953	promoter intron
1430	NC_000001	chr1	0	2450272	2450768	exon
1431	NC_000001	chr1	0	2469265	2469512	intergenic
1432	NC_000001	chr1	0	3556424	3556796	promoter partial
1433	NC_000001	chr1	0	3614623	3614831	exon

read_statistics (05)

The file contains the statistical summary of the individual clusters.
1. how many clusters were detected
2. how many reads are located in clusters
3. statistics about size and density of the clusters
4. correlation of the reads with the genomic elements (from read_classification)
The information which fraction of the genome belongs to a specific class is provided, too.

cluster_sequences (06)

The file contains the genomic sequence of each cluster (in FASTA format).

expression_profile (07)

The file contains 19 columns (tab-separated) for each transcript annotated in Eldorado.
1: transcript ID (Eldorado)
2: accession number of the transcript (external e.g. RefSeq, Genbank, Ensembl)
3: locus ID (Eldorado)
4: symbol of the gene
5: gene ID (NCBI Entrez Gene, 0 if not available, -2 if ambiguous)
6: contig/chromosome accession number
7: chromosome
8: strand
9: start position of the transcript
10: end position of the transcript (start < end)
11: length of the transcript (sum of exons)
12: number of exons
13: number of reads in all exons
14: normalized expression value for the least expressed exon
15: normalized expression value for the most expressed exon
16: normalized expression value for the whole transcript
17: RPKM value for the whole transcript
18: transcript source (coded as integer value)
     1 = NCBI RefSeq
     5 = Ensembl
     6 = NCBI GenBank
     8 = Genomatix TransMapping
     9 = www.yeastgenome.org (s.cerevisiae)
    10 = VectorBase
    11 = University of Pennsylvania
    12 = Phytozome
    14 = www.maizcdna.org 
    15 = www.maizsequence.org

Example:

TranscriptId    Accn              LocusId       Symbol           GeneId   ContigAccn   ...
GXT_25663036	ENST00000361851	  GXL_1647486	ENST00000361851	 0	  NC_012920    ...
GXT_23216842	AK303788	  GXL_175149	FGB	         2244	  NC_000004    ...
GXT_2831806	NM_001002235	  GXL_101109	SERPINA1	 5265	  NC_000014    ...


...   Chromosome    Strand           Start               End      Transcript length   #exons   ...
...   chrMT	       +	      8366	        8572	           207	        1      ...
...   chr4	       +	 155484163	   155492138	          1770	        9      ...
...   chr14	       -	  94843084	    94857029	          3199	        5      ...


...   #reads   min.NE (exon)   max.NE (exon)   NE (transcr.)           RPKM      transcript source
...     5646	    34.66449	    34.66449	    34.66449	 3365.74316	        5
...    49360	    18.47779	    51.47432	    34.54972	 3441.21899	        6
...    88087	     0.17496	    93.96494	    34.29131	 3397.87988	        1

The corresponding .bed file contains the genomic positions of the transcripts in BED-file format (e.g. for upload in other tasks).

The 07_expression_profile_top.list file contains the GeneIDs of the top 5,000 expressed genes, the normalized expression value of their highest scoring transcript and the gene symbol. (This file can be used for upload in GePS).

The 07_expression_profile_all.list file contains the GeneIDs of all genes annotated in ElDorado, the normalized expression value of their highest scoring transcript (if existent) and the gene symbol.