Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to "Which Cloud" #128

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 28 additions & 43 deletions episodes/04-which-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,37 +60,21 @@ pay for Amazon using grant money, however universities are getting better about

### Open Science Clouds

#### [XSEDE](https://www.xsede.org/)

The Extreme Science and Engineering Discovery Environment (XSEDE) is an NSF funded HPCC, so
it is open to any US-based researcher, and shares most of the same benefits and drawbacks
of a university or corporate HPCC. If your university or corporation doesn't have it's
own HPCC resources, XSEDE will likely be your cheapest option.

Although any US-based researcher can use XSEDE, first [they'll need an account](https://portal.xsede.org/#/guest).
Like the HPCC options described above, XSEDE uses a scheduler to start jobs, and puts limits on
how many resources any one user can utilize at once.

XSEDE can also be a bit intimidating at first because you will need to know what resources
you need, and for how long, before you get started. XSEDE runs like a mini version of the
NSF grant system. In order to qualify to submit large jobs, you'll have to submit a [allocation request](https://portal.xsede.org/allocations/research), in the form of a short proposal.
Also like an NSF grant, if your proposal is accepted, that means you have access to whatever
resources you were approved for, for the time frame you requested.

Don't let that paragraph scare you off though. XSEDE has two different allocation tracks. If
you aren't sure exactly what you'll need for your big project, you can request a [startup allocation](https://portal.xsede.org/allocations/startup) which only requires an abstract
rather than a proposal, and grants you a year to try out your new pipeline or analysis. These
are usually granted in a week or so, and are intended for you to test your pipeline so you
know what to ask for in your allocation proposal.

If that still sounds a little too daunting, XSEDE also has [trial allocations](https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/76149919/Jetstream+Trial+Access+Allocation)
which give you access to only a tiny fraction of XSEDES power, but are plenty large enough to
test your code and see if a larger allocation is worth pursuing. These allocations are granted
more or less immediately by simply filling in a form and agreeing to the usage rules.

If you're interested in using XSEDE, check to see if your workplace has a [Campus Champion](https://www.xsede.org/community-engagement/campus-champions). These are people who
have had extensive training on both the XSEDE system and the allocation program, and can
help you figure out how to apply and what you need.


#### [ACCESS](https://access-ci.org/)

The successor to XSEDE (see: [https://www.xsede.org/](https://www.xsede.org/)), ACCESS (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support) is an HPCC funded by the US National Science Foundation, and is open to any US-based researcher. Using the resources requires first making an account, and then submitting an allocation request (see: [Getting Started on ACCESS](https://access.qltddev.com/about/get-started/#start).

While the old XSEDE resource proved intimidating for many users, the new ACCESS organization offers support in the form of a [Knowledge Base](https://support.access-ci.org/knowledge-base), Support Ticketing System, and community-led support organizations. Help and advice is also available for creating your allocation request. [Allocation requests can be submitted to one of four tiered tracks](https://allocations.access-ci.org/prepare-requests-overview), with each tier awarding larger maximum amounts of computing credits (processor time) and requiring increasingly more in-depth proposals. A small "pilot" project in the "Explore" tier, for example, requires just a single paragraph overview and awards up to 400 credits, suitable for small projects and testing planned larger workflows.

After receiving approval for resource allocation, users may connect to their resources via a web portal or a terminal. A [searchable and filterable list of the different resources](https://allocations.access-ci.org/resources) may help users in determining whether ACCESS can meet their needs.

##### [JetStream2](https://jetstream-cloud.org/)

Supported by a National Science Foundation grant, JetStream2 is one of the main resources for the aforementioned ACCESS consortium. Users log in using their ACCESS account on a web interface called **Exosphere**. The Exosphere web app gives a graphical user interface to create a virtual machine instance that they can log into and work interactively. Extensive [documentation and training](https://jetstream-cloud.org/documentation-training/index.html) is available for new users, including a tutorial in [how to create a new instance](https://docs.jetstream-cloud.org/ui/exo/exo/) and [how to attach a storage volume for files](https://docs.jetstream-cloud.org/ui/exo/storage/). As part of the ACCESS resources, Jetstream2 use is cost-free, but requires an allocation request.

According to the [JetStream2 overview](https://docs.jetstream-cloud.org/overview/overview-doc/), Jetstream2 is primarily for small-scale on-demand processing: "Jetstream2 may be used for **prototyping**, for creating tailored **workflows** to either use at smaller scale with a handful of CPUs or to port to larger environments after doing your **proof of concept** work at a smaller level."

#### [Open Science Grid](https://opensciencegrid.org)

Expand Down Expand Up @@ -121,15 +105,17 @@ resources and when you submit your work, it could run almost anywhere in the ove

The Open Science Data Cloud provides the scientific community with resources for storing, sharing, and analyzing terabyte and petabyte-scale scientific datasets. OSDC's Bionimbus Protected Data Cloud (PDC) is a platform designed with the sole purpose of analysing and sharing protected genomics data.

#### [Atmosphere](https://pods.iplantcollaborative.org/wiki/display/atmman/Getting+Started)
#### [OpenStack](https://www.openstack.org/)

OpenStack is a non-profit alternative to the Commercial Clouds discussed below--that is, OpenStack provides "Infrastructure as a Service" (Iaas). Access is paid for by the hour. However, the infrastructure and resources available are orders of magnitude more than the free cloud services above. If you have prototyped a workflow on a free resource, but need to scale up to much larger RAM and CPU instances, OpenStack could be a good choice. You can [read about scientific research stories using OpenStack](https://www.openstack.org/use-cases/science/) to learn more and consider whether an OpenStack implementation would be a feasible for your project.

#### [CyVerse (iPlant Collaborative) Atmosphere](https://www.cyverse.org/atmosphere)
##### [CyVerse](https://learning.cyverse.org/)

#### [JetStream](https://jetstream-cloud.org/)
One of the projects based in OpenStack, CyVerse was originally the iPlant Collaborative, which was an NSF-funded project to provide cloud infrastructure for plant researchers. Since 2015, CyVerse has expanded its mission to include all life sciences researchers. As infrastructure, CyVerse is the foundation of [a number of cloud-based scientific projects](https://cyverse.org/powered-by-cyverse), including the open Galaxy instance at [usegalaxy.org](https://usegalaxy.org/). As a platform, CyVerse offers up to 5GB of data storage for free and a fee-based storage system after that and the Data Science Workbench which is an interactive, web-based GUI for running certain kinds of analyses. Perhaps most importantly, CyVerse has an extensive library of education and training resources that can be used modularly for educators. Peruse the extensive [Learning Center](https://learning.cyverse.org/) documents to find out more about this resource.

### Commercial Clouds

Computing architecture is moving (albeit at a slow pace) to the Model-to-Data paradigm. This means that scientists should be encouraged to bring their compute to where the data is stored, instead of the the other way around. The following outlines the general differences between the three major commercial cloud providers: Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure.
Computing architecture is moving (albeit at a slow pace) to the **Model-to-Data paradigm**. This means that scientists should be encouraged to bring their compute to where the data is stored, instead of the the other way around. The following outlines the general differences between the three major commercial cloud providers: Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure.

Essentially all cloud providers provide extremely similar computing and storage options; you can "rent" or provision computing infrastructure with very similar specifications across all three cloud vendors. Even the costs are highly comparable. What governs how to choose the right cloud computing vendor is highly opportunistic: (1)funding options, (2)solidarity with collaborating/similar scientific groups, (3)location of datasets that a particular research group works with and (4)familiarity with cloud vendor services.

Expand All @@ -143,11 +129,11 @@ Essentially all cloud providers provide extremely similar computing and storage
The Amazon Web Service (AWS) that you've been using is the Elastic Compute (EC2) cloud. There
are actually lots of other cloud and storage solutions under the AWS umbrella, but when most
data scientists say AWS, they mean [EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html).

With EC2, you can rent access to a cloud computing resource as small as your laptop, or as large as a 64 processor
machine with 488GB of memory, and with a number of different operating systems. These instances can
be optimized for jobs that are memory intensive, or require a lot of bandwidth, or [almost any other
specific need](https://aws.amazon.com/ec2/instance-types/). There are so many options that we can't
cover them all here, but these are a few popular ones:
specific need](https://aws.amazon.com/ec2/instance-types/). There are so many options that we can't cover them all here, but these are a few popular ones:

##### On-Demand

Expand Down Expand Up @@ -179,21 +165,21 @@ you'll still have to pay for that time.

#### [Google Cloud](https://cloud.google.com/): [getting started](https://cloud.google.com/compute/docs/quickstart)

GCP offers very competitive prices for compute and storage (as of July 2019, their compute pricing is lower than that of AWS and Azure for instances of comparable specifications). If you are looking to dabble in cloud computing but do not need a vast catalog of services, GCP would be a good place to start looking.
GCP offers very competitive prices for compute and storage (as of January 2024, their compute pricing is still lower than that of AWS and Azure for instances of comparable specifications). If you are looking to dabble in cloud computing but do not need a vast catalog of services, GCP would be a good place to start looking. Google Cloud also offers $300 in cloud credits to new users to test and experiment.

Their version of "Spot Intances" are known as pre-emptible instances and offer very competitive pricing. GCP also has TPUs.
Their version of "Spot Instances" are known as pre-emptible instances and offer very competitive pricing. GCP also has TPUs -- Tensor processing unit based instances that are built to handle TensorFlow projects.

#### [Microsoft Azure](https://azure.microsoft.com/en-us/)

If your software requires Microsoft Windows, it may be cheaper to use MS Azure due to licensing issues. Azure's computing instances are known as Azure Virtual Machines and often come at a slightly higher cost than other cloud computing vendors' offerings. If a lot of your computing pipeling is Windows dependent, it may make sense to build everything on MS Azure from the get go.

#### [IBM Cloud](https://www.ibm.com/cloud)

IBM Cloud offers more than 11 million bare metal configurations in virtual mode which are customizable RAM and SSDs on bare metal. They also have an on-demand provisioning for all servers whose management and monitoring included along with the direct and cost-free tech support
IBM Cloud offers more than 11 million bare metal configurations in virtual mode which are customizable RAM and SSDs on bare metal. They also have an on-demand provisioning for all servers with management and monitoring included along with the direct and cost-free tech support

## How to Choose

As you can see, highly managed systems (HPCCs, XSEDE, etc) usually are free or cheap, but
As you can see, highly managed systems (HPCCs, ACCESS, Jetstream2, etc) usually are free or cheap, but
relatively inflexible. There may be certain programs you can't install, or there may be long
wait times. Commercial systems are generally more flexible because you can make them look
however you want, but they can be quite expensive, especially if you run for a long time, or have a
Expand Down Expand Up @@ -249,7 +235,7 @@ Some things to consider:
Note that if you are working with human genomics data there might be ethical and legal
considerations that affect your choice of cloud resources to use. The terms of use, and/or
the legislation under which you are handling the genomic data, might impose heightened information
security measures for the computing environment in which you intend to process it. This is a too broad
security measures for the computing environment in which you intend to process it. This is too broad a
topic to discuss in detail here, but in general terms you should think through the technical and
procedural measures needed to ensure that the confidentiality and integrity of the human data you work
with is not breached. If there are laws that govern these issues in the jurisdiction in which you work,
Expand Down Expand Up @@ -278,4 +264,3 @@ Langmead B, Nellore A (2018) **Cloud computing for genomic data analysis and col

::::::::::::::::::::::::::::::::::::::::::::::::::