Petrel service allows researchers to share data on a grand scale

February 19, 2021  | Argonne ALCF News

Neurons rendered from analysis of serial electron microscopy data-'connectomics.' The inset shows a slice of data from the imaged volume with colored regions of individually labeled neurons that are identified using machine-learning algorithms for automated image segmentation. The magnified view (white box) with the corresponding 3D rendering (red box) illustrates how the segmentation algorithm extracts 3-dimensional object information from the EM volume to map the connectivity properties between neurons in the brain. The full dataset, a sample of brain tissue from a 15-week old mouse, was segmented on hundreds of nodes of the Theta supercomputer, as part of a developmental comparison study. (Data courtesy of Gregg Wildenberg and Hanyu Li, Kasthuri Lab, University of Chicago)

By breaking down barriers to large-scale data sharing, the Argonne-developed Petrel service is enabling science that would not otherwise be possible.

As the use and impact of high-performance computing (HPC) systems have penetrated an ever-wider range of scientific disciplines, the quantities of data generated by researchers have more than kept pace with the accelerating processor speeds driving computing-based research. In particular, advances in feeds from scientific instrumentation such as beamlines, colliders, and space telescopes—among many technologies—have increased data output substantially. Users are producing more data and have more capabilities for using these data. Yet the sheer size and scale of the data can make the seemingly simple task of sharing one’s work a daunting problem.

To deal with the logistics of handling vast quantities of data and to support their rapid, reliable and secure sharing, researchers at the U.S. Department of Energy’s (DOE) Argonne National Laboratory collaborated with Globus to develop and deploy the Petrel data service at the Argonne Leadership Computing Facility (ALCF). Petrel is currently being used for several initiatives at the Advanced Photon Source (APS). The ALCF and APS are DOE Office of Science User Facilities located at Argonne.

“There was growing demand for a way to distribute data more widely so as to easily engage multiple institutions,” said Ian Foster, director of Argonne’s Data Science and Learning (DSL) division. “The remedy was to allocate a more or less broadly accessible storage system and, crucially, to make it manageable via Globus protocols, which are mechanisms for controlling the flow of data, as well as who can see it.”

Petrel and accelerating data-intensive COVID-19 research

During the coronavirus pandemic, Petrel has been instrumental in accelerating data-intensive COVID-19 research that leverages ALCF resources.

Intended to elucidate the nature of the virus and to identify potential therapeutics, researchers are using supercomputers to plumb a massive pool of literally billions of small molecules and execute calculations to estimate their potential properties. Associated descriptors are used to train machine learning models for the identification of molecules that could dock well with viral proteins; candidates are subsequently modeled in molecular dynamics simulations, and the best-performing among them are selected for laboratory synthesis.

“We’ve got hundreds of terabytes of data that have been created by different members of this collaboration, and we’re using Petrel to organize all of those data,” Foster said. “In tandem with a relational database we’ve constructed, Petrel has become the place where we collect, share and access all of our computed results, so it’s become a vital component of this important work.”