Thursday, 22 May 2014

Interactive RNA-Seq analysis using Degust (part 1)

tl;dr  Dynamic MDS plots with degust are very useful - all the cool kids are using them.  Try the interactive demo


Degust is a freely available, web-based tool for analysis of differential gene expression data.  It was primarily designed for RNA-Seq data, but it can also be used with microarray experiments.   Degust focusing on being fast and interactive, while still using sound analysis techniques (the degust backend performs analysis using voom+limma)

In this post I will briefly outline the advantages of the dynamic and interactive MDS plot feature of degust.   Trying it with your own data is easy (skip to Getting Started).

In a later post I'll describe some of the other features of degust, and how to use it with your own analysis.

Multidimensional Scaling Plot

A Multidimensional Scaling (MDS) plot is a convenient way to visualize how well your replicates have behaved in an RNA-Seq experiment.  These are sometimes called PCA plots, although these are not strictly the same, for our use of MDS here they essentially are.

On an MDS plot we are looking for different things.  One aspect we'd like to see is that our replicates cluster tightly together.  That is, that we see more variability between our conditions than within our replicate groups.    We may also identify clear outlying samples that may need to be removed from the analysis.

You should also look for any obvious structure in your MDS plot.  For example, older microarray data often had strong batch effects.  If there is any such structure that you are not interested in, then you should consider ways to remove it or model it in your analysis.

An example MDS plot from degust

It is important to remember that a two dimensional MDS plot shows only the 2 largest dimensions of variability.  It can be important to check the other dimensions to see if any of those show some structure from your experiment.

The MDS plot in degust includes a chart showing the magnitude of the first few dimensions so it is immediately obvious if other dimensions account for a large proportion of the variability.  

Adding Interactivity

It is common to use a subset of the genes to produce an MDS plot.  For example, the plotMDS function in the limma analysis package uses the top 500 most variable genes by default.   Using a different number of genes can often produce a significantly different MDS plot and reveal different information about your data.

Adding interactivity is like adding a new dimension to your visualization.  With degust it is possible to quickly change the number of genes used to calculate the MDS and immediately see the results.  So, you can quickly see how MDS plot changes when considering just the few most variable genes, or the top 500 genes, or every gene.

Further, the degust interface includes an option Skip genes.  It is possible using this to ignore the most variable genes and see if there is still the expected structure in your MDS plot.  This can useful to get a rough idea about how many genes account for the structure you see in your MDS plot.  

Interactive MDS plot demo

Gory details

The MDS calculation is performed purely in the browser in degust.  The counts are normalized for library size, then transformed as transformed = log2(10+count).  The genes are ranked by descending variance across all the genes, then the top number of genes are discarded as defined by Skip genes.  Then the next number of genes defined by Num genes are selected to compute the MDS from.   This set of genes is shown in the table below the MDS plot.

The numeric javascript library is used to perform the MDS calculation using the singular value decomposition (SVD) function.  The final values are then divided by sqrt(num genes) so the distance between a pair of samples on the MDS plot may be interpreted as the "typical" root-mean-square log2-fold-change between those samples.

This calculation is inspired by the plotMDS function from limma with the parameter gene.pairwise=common.

Getting Started with Degust

For the purposes of this post, I'll assume that you have mapped your RNA-Seq reads, and produced per-feature counts (for example using Bowtie and htseq-count).  Format these gene counts as a CSV file with a row per gene and a column per sample and you're ready to use degust!

Upload your CSV file to our public server.  You'll then see a web page for configuration.  The important fields to complete are:
  • provide a useful Name for your data.
  • specify one or more Info columns that will be used to display information on each gene
  • use the Add condition button for each condition you have in your experiment.  Use the pull-down to select columns from your CSV file that contain read counts for each replicate of that condition
Now Save changes and View.

Degust contains many other useful features for analysis which I'll talk about next time.

Monday, 10 February 2014

Visualizing Cuffdiff output using Vennt

I recently added a new option to Vennt which makes it easy to directly visualize the output from your Tophat/Cufflinks/Cuffdiff analysis.  This is particularly useful when comparing multiple conditions in a single Cuffdiff run.  Vennt will produce an html page which lets you look at the overlaps between each pairwise comparison and to dynamically alter the thresholds for significance.

As an example, say we have 3 experimental conditions; Wildtype, Mutant 1 and Mutant 2.  Using Vennt it is simple to:

  • View all the significantly differentially expressed genes in Mutant 1 vs Wildtype
  • Dynamically adjust filters for significance using the p-value cut-off, or Fold-change
  • Find the set of genes that are significant in both Mutant 1 vs WT, Mutant 2 vs WT.
  • Download any of these gene-sets from the various overlaps as a CSV file
  • Share the results of the analysis in a single HTML file

To try the new cuffdiff feature in Vennt, I found a paper that had shared their Cuffdiff output through GEO.  I was able to download the Cuffdiff results, then run the gene_exp.diff file through to produce a single HTML file to visualize the analysis results.  This particular experiment has 7 conditions making for 21 pairwise comparisons! It is much easier to explore this type of experiment with a tool like this (although I'd prefer to use Degust).  The resulting HTML file large (39MB) since it essentially embeds the whole expression file.  You can try the resulting HTML file for yourself.

Tuesday, 29 October 2013

Vennt : Venn diagrams for lists of Differential Expressed Genes

Recently, I have analysed a number of RNA-seq experiments.  Often these experiments have multiple conditions so it is possible to perform several different pair-wise comparisons.  After doing a number of these pair-wise comparisons, my collaborators invariably ask for the overlap between the gene-lists - or better yet a Venn diagram of them.

So, I wrote a web tool, Vennt, that draws Venn diagrams for gene lists.  Venn diagrams for up to 4 lists are drawn, and we can dynamically adjust the thresholds for significance, and get the explicit gene list for the various overlaps.

It's easy to use for your own from a CSV file of your gene lists.  Just follow the instructions.  Try it out, and let me know what you think!

Try a live demo

Disclaimer: keep in mind that Venn diagrams for gene lists from differential expression experiments is often misleading.  Primarily due to the arbitrary thresholds we use.  I have been known to explain this to collaborators at length, perhaps this tool is my attempt at venting...  However, from using this tool I have actually found that it can be a useful way to explore your gene expression data.

Monday, 25 June 2012

Slowdown with GHC when using multiple CPUs

Recently I was writing a tool to perform a simple calculation on several large fasta sequence files. When I had it working, I thought "cool, this is haskell, let's speed it up by simply running on multiple processors". Everything worked beautifully with a near linear speedup by increasing number of cores up to 4.

However, I later discovered a serious problem. I was running it on a shared server which had one of its 4 cores pegged at 100% by another user. I didn't notice this straight away, and ran my tool telling it to use all available cores, and instead of the nice speed up, it took significantly longer than using a single core!

I have since discovered that this problem is documented in the GHC docs:
Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.
I think this is a serious problem, because I just want to be able to provide a tool for people to use. It is not realistic to have the users of the tool estimate how many CPU cores to use, and in fact, it is very difficult to estimate since the tool might be long-running, and the load on the shared server quite variable.

I have since reproduced this problem with code taken directly from the haskell wiki
{-# LANGUAGE BangPatterns #-}
    import Data.Digest.Pure.MD5
    import qualified Data.ByteString.Lazy as L
    import System.Environment
    import Control.Concurrent
    import Control.Monad (replicateM_)
    main = do
        files <- getArgs
        str <- newEmptyMVar
        mapM_ (forkIO . hashAndPrint str) files
        printNrResults (length files) str
    printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
    hashAndPrint str f = do
        bs <- L.readFile f
        let !h = show $ md5 bs
        putMVar str (f ++ ": " ++ h)
I compiled this using GHC v7.4.1 on Mac OSX 10.7, using the follow command
   ghc --make -rtsopts -threaded -O2 run.hs
I tested this on my desktop which has 4 real cores, with reading in 4 large files simultaneously. There was an almost linear speedup when using 2 or 4 cores:
  ./run +RTS -N1   : 20.4 sec
  ./run +RTS -N2   : 11.0 sec
  ./run +RTS -N4   : 6.7 sec

To demonstrate the problem, I then ran 2 different processes that would just keep 2 of my CPU cores busy. That is, for these tests 2 of my 4 CPU cores already had 100% load.
  ./run +RTS -N1   : 23.5 sec
  ./run +RTS -N2   : 14.1 sec
  ./run +RTS -N4   : 57.8 sec   <---- Blowout...

Notice that when told there are 4 available cores to run on, but in fact 2 of those cores were already busy, the program took over twice as long as when told to only use 1 core. Running the program with "-sstderr" typically showed "%GC time" at around 6%. In the final test case this increased to 52.4%.

This is quite a concern in practice since it is desirable to be able to distribute a binary that is always run with "+RTS -N" (telling GHC to automatically determine the number of capabilities). However, this will work very poorly on machines with unpredictable load, such as shared servers.

Output from ThreadScope

Below I show 2 screenshots from ThreadScope from running the test program with -N4.  First with all CPUs idle, then with 2 of the 4 CPUs already busy.

Clearly the issue is some interaction with garbage collection.  I'd love to know if this is a fundamental problem with the GHC RTS scheduler or if there is something that could be done to improve this situation.

Update (2012-07-02)

After posting to haskell-cafe, an explanation from Alexander Morozov suggested it was the parallel GC.  This has to stop all threads when doing GC, and if one of the threads can't complete GC during its time-slice then all threads have to block until that thread gets another time-slice to complete its GC.  He suggested using the single threaded GC which performed better on his machine.  Re-running my tests using (-qg)

Single threaded GC, with no load on CPUs:
  ./run +RTS -N1 -qg  : 18.7 sec
  ./run +RTS -N2 -qg  : 10.2 sec
  ./run +RTS -N4 -qg  : 8.0 sec

Single threaded GC, with 2 of 4 CPUs busy:
  ./run +RTS -N1 -qg  : 21.5 sec
  ./run +RTS -N2 -qg  : 11.8 sec
  ./run +RTS -N4 -qg  : 12.1 sec

These numbers are much better - so I think I'd be recommending single threaded GC on a machine with unpredictable load.

Sunday, 24 June 2012

Awesome shell feature : Process substitution

Just last week I was discussing with colleagues at the VBC that mkfifo is useful, but what you typically want is to have the fifo created and cleaned up for you by the shell.  It turns out this already exists in popular shells... How is it that I've never come across this in all my years using unix!?!

Working in bioinformatics we find many tools that cannot decompress sequence files automatically.   When the tool in question only takes a single sequence file we can typically use a standard shell pipe, for example :
    zcat seq.fa.gz | wc

However, when the tool takes multiple files we need another approach.   Imagine we have a tool, "seqs", which takes two file parameters "-file1" and "-file2" but we want to give it the compressed files without decompressing them to the filesystem.  We could do this using mkfifo :
    mkfifo pipe1 pipe2
    zcat seq1.fa.gz > pipe1 &
    zcat seq2.fa.gz > pipe2 &
    seqs -file1 pipe1 -file2 pipe2
    rm pipe1 pipe2

There is a much nicer way to do this using Process substitution
    seqs -file1 <(zcat seq1.fa.gz) -file2 <(zcat seq2.fa.gz)

It is also possible to use this for a programs output.  This is often useful in conjunction with the tee tool to send a programs output to several different programs.  Here's an example, from the tee manual that demonstrates download a file, and computing the sha1 hash and md5 hash at the same time :
    wget -O - \
      | tee >(sha1sum > dvd.sha1) \
            >(md5sum > dvd.md5) \
      > dvd.iso

Process substitution is supported by bash and zsh, but not by csh based shells.

One limitation of using process substitution this way is that it uses pipes, so if your tool tries to seek() around in the file it will fail.  zsh has a nice feature to overcome this limitation by creating a temporary file and deleting it afterwards : simple use =(...) instead of <(...).