Genomatix-Logo
Overview of Help-Pages

Genomatix: Principal Component Analysis for RNASeq Data
(only available on GGA)


[Introduction] [Parameters] [Output]

PCA Introduction

Principal component analysis (PCA) is a statistical procedure that can be used for exploratory data analysis. PCA uses linear combinations of the original data (e.g. gene expression values) to define a new set of unrelated variables (principal components). These new variables are orthogonal to each other, avoiding redundant information.

Thus, PCA can be used to reduce the dimensions of a data set, allowing the description of data sets and their variance with a reduced number of variables. Since similarities between data sets are correlated to the distances in the projection of the space defined by the principal components, PCA can also be used to identify outliers with respect to the principal components.

It is often sufficient to look at the first two components, as these describe the largest variability.

A more detailed description of the PCA is available in Wikipedia.

What the program will do

This task can be used to get an impression on the similarity of RNA-sequencing samples, i.e. to identify subgroups or outliers.

The variance in RNA-Seq data usually grows with the expression mean. PCA on the matrix of normalized read counts will often lead to principal components that are dominated by the variance of a few highly expressed genes. DESeq2 regularized-logarithm transformation (rlog), transforms the data matrix of read counts per gene (or transcript) to log scale but specifically adopts for the high random noise of low count data. This is explained in detail on "RNA-Seq workflow: gene-level exploratory analysis and differential expression". The matrix of raw counts is input to the DESeq2 rlog function and the resulting transformed matrix is used as input for the principal component analysis (PCA, using the R package pcaMethods):

Stacklies et al. (2007)
pcaMethods - a Bioconductor package providing PCA methods for incomplete data
Bioinformatics, 23, pp. 1164-1167

Rlog transformation is the default. Although not recommended, it is possible to do PCA directly on normalized expression values. Based on the read distribution in the input files a normalized expression value (NE) will be calculated for each locus (or transcript) for each input file. The NE-value is based on the number of reads located in the exons of the locus/transcript and is normalized to the length of the locus/transcript and the density of the data set. The resulting matrix for NEs is then used as input for the principal component analysis as described.


Parameters

Input
Input file(s) with read positions from RNA-Seq

Input data are accepted in BED / bigBed file format or BAM file format containing the input regions. For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks. The limits are usually shown on top of the input pages.

Within this section you can either
  • choose from previously uploaded BED/BAM files
  • or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")
For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

  • upload one or several BED/BAM files from your local computer
  • or import one or several BED/BAM files from the GMS (see more details)
  • or import one or several BED/BAM files from the GGA (see more details)
For the new BED/BAM files, you will have to select the correct organism, as the organism and the genome build are associated with the BED file for future use (the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window, the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the project management.

Options rlog Transformation:
Transform the count data matrix by the DESeq2 rlog function [default]
Alternatively the normalized expression values (NEs) are used for PCA.

Parameters for PCA Number of Groups:
Here you can select the number of groups for your samples (e.g. the 3 groups control, treatment 1, and treatment 2), with a maximum of 13 groups.

Group properties:
A box will appear for each group: here, you can use drag and drop to assign the samples from the available-files-list above to your groups.
You can also edit the group names by clicking on the little pencil icon, and select the color that will be used for the specified group in the output graphics.

Transcript/Locus The expression analysis can be based on different units of underlying data:
  • Locus-based expression analysis:
    The exons of all transcripts with the same GeneID within a Genomatix locus are taken together and this "gene body" is used for counting reads (i.e. reads in overlapping exons of transcripts within the same locus are counted once)
  • Transcript-based expression analysis:
    All transcripts are considered separately when counting reads in exons (and reads within overlapping transcripts/exons might be counted several times)
If the transcript-based expression analysis is checked, the transcripts used for expression analysis can additionally be constrained by their source (e.g. NCBI RefSeq). By default, all non-redundant transcripts available in ElDorado are used. Depending on the organism, several transcript sources are available. For example, human and mouse transcripts are available from
  • NCBI RefSeq
  • Ensembl
  • NCBI GenBank
For plants, additional sources may be available (e.g. Phytozome for Glycine max).
Output
Result Here, you can edit the default name of the result file.
Email address
An email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!


Output

The output has a number of sections, depending on the input (one or two data sets) and parameters:

  1. Analysis Parameters
  2. Overview
  3. PCs (principal components): the top loadings for the first PCs accounting for 90% of variance in the data
  4. 3D Score plot
  5. Download of Data Files

The result sections are described in detail below.


1. Analysis Parameters


2. Overview

The example shown below is based on data published by Roforth et al:

Roforth MM, Atkinson EJ, Levin ER, Khosla S, Monroe DG.
Dissection of estrogen receptor alpha signaling pathways in osteoblasts using RNA-sequencing.
PLoS One. 2014; 9(4).

Overview table

Samples Number of samples submitted to analysis
PCs Number of principal components calculated (max 10)
Variables Number of loci or transcripts considered for analysis
Method svd = singular value decomposition (details)
R2 The proportion of variance explained by each PC calculated (eigenvalue)
R2cum The cumulative proportion of the variance explained by the current and all preceding principal components.

Example:

example overview table

Score plot

The score plot displays each sample in the data set with respect to the first two principal components and can therefore be used to interpret the relations among the samples. This information can be used to identify outliers.

In the example below, the replicate samples show a high similarity with respect to the first two principal components, a small within group variance and a good separation of groups.

example score plot

Scree plot

The scree plot visualizes which principal components account for which fraction of total variance in the data. The principal components are listed by decreasing order of contribution to the total variance. The bars show the proportion of variance represented by each component (R2) and the points shows the cumulative variance (R2cum).

Example:

example scree plot

Loadings plot

The loadings plot is a plot of the relationship between original variables (genes) and subspace dimensions. It allows the identification of genes that are most strongly correlated or anti-correlated with the first two principal components.

example loadings plot


3. Principal Components (PCs)

For the top principal components that are needed to account for 90% of the variance in the data (or up to a maximum of 10 PCs) the 40 transcripts/loci with the highest absolute loadings are shown.

Example:

example loading plot
example loading plot


4. 3D Score plot

This score plot displays each sample in the data set with respect to the first three principal components.

Example:

example 3D score plot


5. Download of Data Files

File Description
91.input_replicate_analysis expression data used as PCA input (the transfomed count data in case of rlog transformation or normalized expression values (NEs) if rlog was switched off)
Statistics.xml Information displayed in overview tables
Loadings.xml Information displayed in loadings tables (rank, geneID, symbol and loading)
scores.tsv Scores for all samples for each PC calculated
loadings.tsv Loadings for each transcript/locus for each PC calculated
07.expression_profile.tsv expression values for each transcript/locus for each input file (separately) (detailed description)
ScorePlot2D.png Score plot for first two PCs
ScorePlot3D.png 3D-score plot for the first 3 PCs (only if more than 2 samples were submitted)
ScreePlot.png Scree plot for all computed PCs
Loadings.png Loadings plot for the first two PCs
Loadings_PCx.png Plot of the top 40 loadings for PCs contributing to 90% of variance (maximum 10)