Case Study: Transfer of large datasets to Amazon S3

May 14, 2015   |  Vas Vasiliadis

Knowledge Lab is a University of Chicago research center that seeks to leverage insights into the dynamics of knowledge creation and advances in large-scale computation to reimagine the scientific processes of the future by identifying gaps in the global knowledge landscape, areas of rich potential for breakthroughs, and automating discovery through the generation of novel, potentially high impact hypotheses.
As the executive director of Knowledge Lab, myself and my team are faced with Big Data challenges on a daily basis. In particular our lab leverages migration, storage, and analysis to accelerate scientific progress by conceiving of and implementing revolutionary computational approaches to reading, reasoning, and hypothesis design that transcend the capacity of individual researchers and traditional teams. The data sets upon which our analyses are built are many terabytes in size and, often, our data resides in silos that are not easily accessible to our decentralized research network of 40+ field leading scientists, mathematicians, engineers, and scholars. Meeting our research needs requires novel methods for dissemination and sharing. To that end, we rely on the  “centralizing” power of cloud and distributed computational resources.
Given the distributed nature of our center and our frequent use of cloud storage, we were delighted to hear Globus recently added support for Amazon Simple Storage Service (S3). We use transfers to AWS S3 to facilitate multi-­institutional research, create analysis pipelines between S3 and various computational services (EC2, RDS, EMR) in the AWS cloud, as well as for redundancy.

Over the past several months we have been investigating methods for moving two large datasets from the Open Science Grid (OSG) to S3. The first transfer was the complete, full text and all associated metadata of the entire, extant IEEE corpus. This was a transfer of 6,134,400 files and directories at 2.59 TB. Most files were less than 1MB in size, so this transfer highlights many of the file transfer optimizations available only through Globus. Previous attempts to transfer this corpus from OSG via rsync and parallelized rsyncs failed due to projected transfers lasting on the order of several months. Using Globus S3 support, we were able to complete this transfer in a matter of days. The second corpus is the entire English language Wikipedia with complete revision history (back to Wikipedia’s inception). This was a transfer of 173 files at 12TB, which we were able to complete in approximately 3 days using Globus.

Both transfers drove home the advantages of using Globus for our work as each were orders of magnitude faster than more conventional methods. In addition, each transfer had 26 faults that were automatically recovered from in an unsupervised way. Moreover, my group and I have confidence in the copied data due to Globus automated integrity checking.

Eamon Duede

Eamon Duede
Executive Director, University of Chicago Knowledge Lab