I like Python. I think it's an ideal language for Bioinformatics, because
it allows us to make new things very quickly. There are first class
libraries for scientific computing like numpy,
pandas and many others that make us more
productive programmers. The Python Package Index
contains thousands of other packages to do all sorts of things, and we can
install and try out these packages in seconds using
pip. The language itself is
very clean and easy to learn.
There is a persistent idea out there, however, that Python is too slow to
be of real use in large-scale Bioinformatics work. For example, if we
were to write a variant caller in Python, it would just be too slow. While
this is certainly true if we were to write all of the application in Python,
it's not necessarily true if we get our layering right.
One of the beauties of the design of Python is that we can easily incorporate
code written in C and C++ as extension modules. There are a variety of
ways of doing this. We can use the Python C API directly, or there are third
party tools such as Cython to simplify the process.
This gives us the best of both worlds: the ease of programming and flexibility
of Python along with the speed of C and C++. However, people seem to
dismiss Python as still being somehow too inefficient.
I want to challenge this idea with an example using my coalescent simulator
msprime. The package provides a simple
Python API to allow
users run simulations and also analyse the resulting trees. The core of the
simulation and analysis code is written in C, but this is entirely hidden
from the user. Because we have a Python API to access the simulation code,
writing command line interface programs is really easy. The
argparse library lets
us write fully POSIX compliant CLIs in a very straightforward way, and
setuptools makes the proper
installation of these binaries incredibly easy.
Being able to write POSIX compliant command line interfaces that can be
installed into the user's system using standard methods is a big advantage over
C and C++ in my view. Writing CLIs in C is a huge pain. It's a surprisingly
time-consuming task because the standard C library functions functions are
quite low-level. Inevitably, everyone has slightly different ideas for how a
command line interface should be structured, and so we have lots of 'quirky'
interfaces for bioinformatics programs. The Python argparse module provides
standardised methods of handling subcommands, arguments, options, help output
and error handling. This means, for example, that I don't have to worry about
what the correct exit status is when the user provides bad input. I can get on
with writing interesting code that hasn't been done thousands of times before
by other people.
Correctly handling program installation is an even bigger gain, in my
opinion. While there are certainly good ways to handle the installation
of programs written in C and C++
and cmake for example), I don't think anyone
would argue that these are quick or easy. Installation must be done
via downloading a tarball, running
make install. Using
Python, we can upload a package to PyPI and the user can then
install it with
pip install <package-name>.
We can use
msprime as an example. After installing the
library dependencies, we can
install the application using
$ sudo pip install msprime
This installs two programs,
into the appropriate location on the system (depending on the platform),
as well as installing the Python modules. We can then immediately
use the programs to run some simulations.
$ msp simulate 1e5 out.hdf5 -m 1e8 -u 1e-3 -r 1e-3
Here we reproduce the simulation of 100,000 samples over a 100 megabase region
used by Ryan Layer and others to test out
tool. The simulation took about 8 minutes and the output is a 102MB HDF5 file. This HDF5 based format is only used by
msprime, and so we provide some tools to convert to other widely used
formats. The command
msp newick takes a HDF5 tree sequence file as an
argument and writes out the encoded genealogies in Newick format.
$ /usr/bin/time -v msp newick out.hdf5 | head -n 10000 > subset.txt
Command terminated by signal 13
Command being timed: "msp newick out.hdf5"
User time (seconds): 15.49
System time (seconds): 5.44
Percent of CPU this job got: 50%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.40
$ ls -lh subset.txt
-rw-r--r-- 1 jk jk 19G Nov 15 16:46 subset.txt
In this example, we convert the the first 10,000 genealogies encoded in
out.hdf5 and write them to the file
subset.txt. This required a
user time of 15 seconds, and gave a 19G output file. Therefore, we output
Newick trees in this example at a rate of around 1.2 GB per second, which it
is very unlikely the average I/O subsystem will be able to sustain for long.
There is therefore no real point in being faster than this, and if the entire
application was written in C from top-to-bottom, it's unlikely we would notice
any performance benefit.
Here is the current full implementation of the
msp newick command:
tree_sequence = msprime.load(args.history_file)
for length, newick in tree_sequence.newick_trees():
While this is quite minimalistic, it serves to illustrate a couple of points:
- Writing command line interfaces in Python is easy. Very little code
will give you a fully functional, POSIX compliant program that is
simple to deploy on a wide range of platforms.
- Python programs do not need to be slow. They can write output as fast
as you can reasonably expect any I/O system to consume it. If your
application is layered appropriately, the difference in performance
between a Python frontend and a full C application is negligible.