Storage Innovation Series: Cloud-Based Storage for Research Use Cases
April 18, 2018 | Steve Tuecke
In the sixth and final article in this series, we turn our attention from on-premise to cloud-based storage. I’ll be addressing four primary types of cloud storage: file storage, object storage, cold storage and file systems, first defining each and then providing my views on how each might be used for research.
Defining Cloud Storage Options: The Facts
Here's a quick overview of the 4 types of cloud storage...
- Cloud file storage (Google Drive, Box, DropBox, OneDrive, etc.) - These storage solutions look kind of like file systems and have desktop clients that let users sync with stored data (rather than mounting it). They are focused on providing day-to-day storage for desktops, laptops, and mobile devices, with a strong emphasis on convenience from such devices. The pricing model is affordable for such uses, and vendors often offer “unlimited” storage as part of a University-wide subscription.
- Cloud object storage (AWS S3, Google Cloud Storage, Azure Blob Storage, etc.) - The focus with cloud object storage is scale, reliability, and performance. These systems are developer-focused (an object storage system is usually a building block for a larger solution) and are far less end-user-friendly than cloud file storage. The pricing model is pay for usage for a combination of the amount of data stored and the amount of data accessed (called egress charges -- more on this below).
- Cloud cold storage (Amazon Glacier, Google Cloud Storage Coldline, Azure Archive Storage Tier) - This is typically a variant of cloud object storage, but with a focus on write once, read seldom use cases (e.g. backup, disaster recovery), usually with much lower storage charges but much higher access charges. Getting access to your data is sometimes not instantaneous.
- Cloud file system (Amazon Elastic File System, Azure Files) - A cloud file system is like a mountable NAS (e.g., NFS file server) or parallel file system in the cloud. The focus is on providing a traditional, shared file system to a set of servers, and it tends to be expensive.
This is, of course, an oversimplification, as there are many variations and gradations. For example, the big public cloud providers are offering an increasing number of object storage options with differing durability, accessibility, and performance, and with varying prices. But this taxonomy will suffice for this discussion.
Assessing Cloud Storage Options for Research
Now that we’ve defined the landscape of options, let’s drill down on how each might be applicable to data management challenges faced by researchers and campuses.
Before we get into this, keep in mind: There is storage, and then there is access. You can pay for the storage space you need, but the ability to easily and cost effectively access, move, and work with your stored data is another matter. You’ll need both for a successful cloud storage strategy.
1. Cloud File Storage: The Limits of “Unlimited”
The big allure of some of the cloud file storage systems for research is their “unlimited” storage and lack of egress charges. This is true of Google’s G Suite for Education and of Box through Internet2 NET+ Box. However, in practice, these cloud file storage systems run into two challenges for research.
The first is that their desktop sync interfaces are of limited use for research data management workflows. Of course, that is the problem Globus solves, which is why we introduced Globus for Google Drive last fall. As one university customer recently told us, were it not for the Globus for Google Drive connector, Google Drive would be “unusable” for research data storage. We are also seeing a lot of demand for Globus for Box from the many campuses with NET+ Box, for this same reason.
However, the bigger challenge with these cloud file storage systems is that “unlimited” comes with a big caveat in the form of rate limiting. For example, while in theory I can store unlimited data on Google Drive and Box, in practice they have rate limits on their APIs that significantly limit how quickly I can move data in and out of these systems, which in turns sets practical limits on how they can be used for research.
In practice, we have observed that these cloud file storage systems work well for modest scale research data (e.g., less than a terabyte), but not for big data. Fortunately, there are many research use cases for which this amount of storage, and the performance constraints on accessing this storage, is entirely adequate. This probably won’t work for your campus’ biggest users, but it may offer a great solution for the other 90% of your researchers who just need reasonable amounts of reliable, accessible storage. It is Globus’ reliable file transfer and sharing features, more so than fast data transfer, that make Globus critical for research use of cloud file storage.
2. Cloud Object Stores: Weighing Cost and Usability
There are 2 main challenges with cloud object storage in the research world: usability and cost. These systems were not designed with end-user ease of use in mind like cloud file storage. They were built for developers, with REST APIs for integration but without robust end-user interface. Therefore, to be useful in a research environment, cloud object stores require solutions like Globus for automated transfer, sharing, etc.
In addition, cost is often an issue for larger research data. There are 2 elements to cloud object storage cost: cost per GB stored and network egress. Cloud object storage is often considered too expensive for research. Based on our many conversations with research computing directors, we regularly hear that campuses need storage solutions priced at well under $100 per TB per year, else their researchers will cobble something together themselves. However, most cloud object storage costs several times more than this, though there are some new players such as Wasabi that are hitting that magic storage price point.
The other cost element is network/data “egress” -- fees that cloud object storage vendors charge for moving data out of storage. Rates are usage-based, and for many research use cases where data sharing is required, it can be hard to predict when data will move and how much. So users signing up for cloud object storage need to be sure of their use case profiles since massive data movement out of storage can drive costs way up. But our experience is that most users don’t think about this, as they are used to paying (at least a little) for on-prem storage, but network is free to them.
Cloud vendors like Google, Amazon Web Services and Microsoft have acknowledged these network egress issues and are working to help campuses keep costs down. For example, Google recently announced a data egress fee discount for Google Cloud Platform for Internet2 members which waives all data egress charges up to 15% of a member’s total bill (e.g. if your bill is $10,000 a month and the data egress charges are $1500 or less, those charges are dropped from your bill). Microsoft Azure and Amazon Web Services also have programs for network egress waivers for the academic community. While knowledge of these programs seems to be somewhat limited in our community, they should make a substantial difference in cloud object storage adoption in research” as long as your data egress charges are reasonable, the cloud vendor will waive them, leaving only the storage costs to contend with.
Given the combination of falling storage costs, network egress charge waivers, and a good user interface for researchers, I’m bullish on adoption of object storage within the research community. To address the user interface aspect, Globus is investing in connectors for cloud object storage systems -- last year we introduced Globus for AWS S3, and we will be adding support for other cloud object storage systems in the coming year. Cloud object storage is already used in research, particularly as temporary storage associated with use of cloud computing resources, but I think cloud object storage can play an even bigger role in satisfying research storage needs going forward.
3. Cloud Cold Storage: Connectivity is Key
Cloud cold storage is essentially cheaper object storage, based on tape storage or some other technology that offers dramatically lower storage cost at the expense of higher access cost and lower access performance. It is designed for backup and archive use cases, in which data is read infrequently, if ever.
Cloud cold storage has some uses in the research community (for example, for keeping a disaster recovery copy of valuable data). There are also opportunities to use cloud cold storage for research data archives, but only for data that is seldom accessed. Unfortunately, the purpose of many research data archives is to make data publicly accessible, which can lead to uncertainty in how much data will be accessed, and therefore what the access charges will be.
Many cloud object storage systems can also be configured with policies for migrating data automatically from object storage to cold storage (e.g. S3 to Glacier via lifecycle config management). This combination of object + cold storage is also useful to consider for common research use cases in which data is actively used for only a limited time, but needs to be retained longer.
4. Cloud File Systems: Too expensive outside specialized use cases
Cloud file systems are a relatively new storage option from cloud vendors that are different from the three other options in one major way: they are mountable file systems. For example, Amazon Elastic File System (EFS) offers a mountable NFSv4 file system, and Microsoft Azure Files provides a mountable SMB file system. While this is exactly what many researchers would like, the problem is that they are prohibitively expensive. I have yet to hear of these being used in research.
Another variant of cloud file systems that is offered by third parties or do-it-yourselfers is to run a parallel file system such as Lustre or OrangeFS on cloud block storage, such as Amazon Elastic Block Storage. While these are useful for niche use cases such as scratch storage for an HPC cluster in the cloud, they are also far too expensive for more general research storage uses.
Hopefully cloud file system costs will fall dramatically, so as to make them more useful for research, but I’m not holding my breath.
In summary, cloud file storage is a good option for modest-sized research storage (<TB) with its low cost and reliability, but rate limits constrain its use for larger research data. Cloud object storage and cold storage pricing is falling, particularly with network egress waivers, to a point that should make it far more appealing to research use. Cloud file systems are simply too expensive for anything but narrow niche uses.
With all these cloud storage systems, additional tools are necessary to plug seamlessly into a researcher’s data ecosystem and deliver the capabilities researchers need. Storage is meaningless unless you can easily get to your data and use it.
Find out more about how Globus is meeting these challenges here.