When Trivial Becomes Critical: File Transfer and its Role in Genomic Sequencing

May 19, 2011 | Brigitte Raumann

I think I can speak for most biologists when I say I never thought I would be worrying about file transfer. Compute power, yes. Storage space, maybe. But file transfer? Never. Unlike some other scientific disciplines, biology is not a traditionally ‘big data’ science. Generally, biologists produce data on the scale suited to e-mail attachments. However, seemingly overnight, biology has been propelled into the ranks of the big data sciences. Now a biologist can easily find herself confronted with terabytes of data. Why the change? The answer lies in the recent quantum leaps in DNA sequencing technology.

DNA sequencing itself is not new. Molecular biologists have been sequencing DNA for decades using a technique developed in the late ‘70s. But in 2005, a series of discoveries culminated in the advent of the Next Generation Sequencing (NGS) technologies. (NGS is actually a collection of several distinct sequencing techniques, but one common characteristic is that, unlike earlier technologies, NGS technologies sequence DNA on a massively parallel scale.) As a result, sequencing costs are plummeting and efficiency is soaring such that now a human genome can be sequenced in a week for $1000 in reagent costs compared to approximately 13 years and $300 million in reagent costs it took for the first human genome to be completed in 2003. (Imagine the effect on the airline industry if the cost of a $300 million Airbus A380 dropped to $1000 in the space of eight years.) Now that DNA sequencing has become relatively fast and cheap, huge data sets are available to a wide spectrum of biologists -- spanning from the microbiologist who wants to identify the genomes of all microbes hosted by the human body (did you know that 90% of the genomes in your body are microbial, not human?) to the clinician who wants to know how the genomes differ between her diabetic and healthy patients. In fact, DNA sequencing has become so cheap and so fast that some scientists predict that in less than ten years the number of sequenced genomes will increase from the current count of a few thousand to over 200 million. Which brings me back to file transfer. By now, most biologists have at least heard about, if not experienced, the compute power and storage crisis precipitated by NGS data. But I suspect few have considered the additional problem of data movement. My husband, a molecular geneticist, summed up a typical attitude when he said to me “What’s the problem? Just move the files from here to there.” The problem is that thousands of NGS sequencing instruments around the globe are producing massive amounts of data that must be routinely moved off the instruments and on to compute clusters and storage devices. Although scientists running sequencing centers are highly trained experimentalists, they generally have very little IT experience or support. To many, command line interfaces, checksums, firewalls and the like are all awkward, if not totally foreign. In many cases, the usual methods of scp, sftp or rsync are so inefficient, unreliable, or complicated that sequencing centers resort to shipping hard drives. So, any file transfer solution universally adopted by sequencing centers will need to be as easy and reliable as dragging and dropping files onto a hard drive, leaving the hard drive at the FedEx box, and walking away feeling confident that you are done. If it’s not that simple, it won’t be adopted. Globus Online could very well be this solution, based on what I’ve seen so far. It meets the 3 key qualifications of secure, high-speed, and easy to use with a drag and drop interface. I won’t be surprised to see Globus Online become part of the computing infrastructure that supports the genomic age. Although NGS has been around for almost a decade now (in fact ‘3^rd generation’ technologies are now coming onto the market), I’m still awe struck by the impact that one leap in technology has had on an entire scientific field and beyond. And I get a shiver of excitement when I think about the growing mountain of data that is just waiting for us to explore!

Older
GO Behind the Scenes: Globus Online relies on Amazon's EC2

Newer
Getting a Bigger 'N' for Studies of Rare Conditions