Thursday, 22 May 2014

Interactive RNA-Seq analysis using Degust (part 1)

tl;dr  Dynamic MDS plots with degust are very useful - all the cool kids are using them.  Try the interactive demo


Degust is a freely available, web-based tool for analysis of differential gene expression data.  It was primarily designed for RNA-Seq data, but it can also be used with microarray experiments.   Degust focusing on being fast and interactive, while still using sound analysis techniques (the degust backend performs analysis using voom+limma)

In this post I will briefly outline the advantages of the dynamic and interactive MDS plot feature of degust.   Trying it with your own data is easy (skip to Getting Started).

In a later post I'll describe some of the other features of degust, and how to use it with your own analysis.

Multidimensional Scaling Plot

A Multidimensional Scaling (MDS) plot is a convenient way to visualize how well your replicates have behaved in an RNA-Seq experiment.  These are sometimes called PCA plots, although these are not strictly the same, for our use of MDS here they essentially are.

On an MDS plot we are looking for different things.  One aspect we'd like to see is that our replicates cluster tightly together.  That is, that we see more variability between our conditions than within our replicate groups.    We may also identify clear outlying samples that may need to be removed from the analysis.

You should also look for any obvious structure in your MDS plot.  For example, older microarray data often had strong batch effects.  If there is any such structure that you are not interested in, then you should consider ways to remove it or model it in your analysis.

An example MDS plot from degust

It is important to remember that a two dimensional MDS plot shows only the 2 largest dimensions of variability.  It can be important to check the other dimensions to see if any of those show some structure from your experiment.

The MDS plot in degust includes a chart showing the magnitude of the first few dimensions so it is immediately obvious if other dimensions account for a large proportion of the variability.  

Adding Interactivity

It is common to use a subset of the genes to produce an MDS plot.  For example, the plotMDS function in the limma analysis package uses the top 500 most variable genes by default.   Using a different number of genes can often produce a significantly different MDS plot and reveal different information about your data.

Adding interactivity is like adding a new dimension to your visualization.  With degust it is possible to quickly change the number of genes used to calculate the MDS and immediately see the results.  So, you can quickly see how MDS plot changes when considering just the few most variable genes, or the top 500 genes, or every gene.

Further, the degust interface includes an option Skip genes.  It is possible using this to ignore the most variable genes and see if there is still the expected structure in your MDS plot.  This can useful to get a rough idea about how many genes account for the structure you see in your MDS plot.  

Interactive MDS plot demo

Gory details

The MDS calculation is performed purely in the browser in degust.  The counts are normalized for library size, then transformed as transformed = log2(10+count).  The genes are ranked by descending variance across all the genes, then the top number of genes are discarded as defined by Skip genes.  Then the next number of genes defined by Num genes are selected to compute the MDS from.   This set of genes is shown in the table below the MDS plot.

The numeric javascript library is used to perform the MDS calculation using the singular value decomposition (SVD) function.  The final values are then divided by sqrt(num genes) so the distance between a pair of samples on the MDS plot may be interpreted as the "typical" root-mean-square log2-fold-change between those samples.

This calculation is inspired by the plotMDS function from limma with the parameter gene.pairwise=common.

Getting Started with Degust

For the purposes of this post, I'll assume that you have mapped your RNA-Seq reads, and produced per-feature counts (for example using Bowtie and htseq-count).  Format these gene counts as a CSV file with a row per gene and a column per sample and you're ready to use degust!

Upload your CSV file to our public server.  You'll then see a web page for configuration.  The important fields to complete are:
  • provide a useful Name for your data.
  • specify one or more Info columns that will be used to display information on each gene
  • use the Add condition button for each condition you have in your experiment.  Use the pull-down to select columns from your CSV file that contain read counts for each replicate of that condition
Now Save changes and View.

Degust contains many other useful features for analysis which I'll talk about next time.