Sunday, 24 June 2012

Awesome shell feature : Process substitution

Just last week I was discussing with colleagues at the VBC that mkfifo is useful, but what you typically want is to have the fifo created and cleaned up for you by the shell.  It turns out this already exists in popular shells... How is it that I've never come across this in all my years using unix!?!

Working in bioinformatics we find many tools that cannot decompress sequence files automatically.   When the tool in question only takes a single sequence file we can typically use a standard shell pipe, for example :
    zcat seq.fa.gz | wc

However, when the tool takes multiple files we need another approach.   Imagine we have a tool, "seqs", which takes two file parameters "-file1" and "-file2" but we want to give it the compressed files without decompressing them to the filesystem.  We could do this using mkfifo :
    mkfifo pipe1 pipe2
    zcat seq1.fa.gz > pipe1 &
    zcat seq2.fa.gz > pipe2 &
    seqs -file1 pipe1 -file2 pipe2
    rm pipe1 pipe2

There is a much nicer way to do this using Process substitution
    seqs -file1 <(zcat seq1.fa.gz) -file2 <(zcat seq2.fa.gz)

It is also possible to use this for a programs output.  This is often useful in conjunction with the tee tool to send a programs output to several different programs.  Here's an example, from the tee manual that demonstrates download a file, and computing the sha1 hash and md5 hash at the same time :
    wget -O - \
      | tee >(sha1sum > dvd.sha1) \
            >(md5sum > dvd.md5) \
      > dvd.iso

Process substitution is supported by bash and zsh, but not by csh based shells.

One limitation of using process substitution this way is that it uses pipes, so if your tool tries to seek() around in the file it will fail.  zsh has a nice feature to overcome this limitation by creating a temporary file and deleting it afterwards : simple use =(...) instead of <(...).

1 comment: