diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index e9d1c7028..7d30c89c6 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -1,152 +1,123 @@ # Storage and Volumes: Locations - -On the BIH HPC cluster, there are three kinds of entities: users, groups (*Arbeitsgruppen*), and projects. -Each user, group, and project has a central folder for their files to be stored. - -## For the Impatient - -### Storage Locations - -Each user, group, and project directory consists of three locations (using `/fast/users/muster_c` as an example here): - -- `/fast/users/muster_c/work`: - Here, you put your large data that you need to keep. - Note that there is no backup or snapshots going on. -- `/fast/users/muster_c/scratch`: - Here, you put your large temporary files that you will delete after a short time anyway. - **Data placed here will be automatically removed 2 weeks after last modification.** -- `/fast/users/muster_c` (and all other sub directories): - Here you put your programs and scripts and very important small data. - By default, you will have a soft quota of 1GB (hard quota of 1.5GB, 7 days grace period). - However, we create snapshots of this data (every 24 hours) and this data goes to a backup. - -You can check your current usage using the command `bih-gpfs-report-quota user $USER` - -### Do's and Don'ts - -First and foremost: - -- **DO NOT place any valuable data in `scratch` as it will be removed within 2 weeks.** - -Further: - -- **DO** set your `TMPDIR` environment variable to `/fast/users/$USER/scratch/tmp`. -- **DO** add `mkdir -p /fast/users/$USER/scratch/tmp` to your `~/.bashrc` and job script files. -- **DO** try to prefer creating fewer large files over many small files. -- **DO NOT** create multiple copies of large data. - For sequencing data, in most cases you should not need more than raw times the size of the raw data (raw data + alignments + derived results). - -## Introduction - -This document describes the third iteration of the file system structure on the BIH HPC cluster. -This iteration was made necessary by problems with second iteration which worked well for about two years but is now reaching its limits. +This document describes the forth iteration of the file system structure on the BIH HPC cluster. +It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage. +For now, the third-generation file system is still mounted at `/fast`. ## Organizational Entities - There are the following three entities on the cluster: -1. normal user accounts ("natural people") -2. groups *(Arbeitsgruppen)* with on leader and an optional delegate -3. projects with one owner and an optional delegate. - -Their purpose is described in the document "User and Group Management". - -## Storage/Data Tiers - -The files fall into one of three categories: - -1. **Home** data are programs and scripts of which there is relatively few but which is long-lived and very important. - Loss of home data requires to redo manual work (like programming). - -2. **Work** data is data of potential large size and has a medium life time and important. - Examples are raw sequencing data and intermediate results that are to be kept (e.g., a final, sorted and indexed BAM file). - Work data can time-consuming actions to be restored, such as downloading large amounts of data or time-consuming computation. - -3. **Scratch** data is data that is temporary by nature and has a short life-time only. - Examples are temporary files (e.g., unsorted BAM files). - Scratch data is created to be removed eventually. - -## Snapshots, Backups, Archive - -- **A snapshot** stores the state of a data volume at a given time. - File systems like GPFS implement this in a copy-on-write manner, meaning that for a snapshot and the subsequent "live" state, only the differences in data need to be store.d - Note that there is additional overhead in the meta data storage. - -- **A backup** is a copy of a data set on another physical location, i.e., all data from a given date copied to another server. - Backups are made regularly and only a small number of previous ones is usually kept. - -- **An archive** is a single copy of a single state of a data set to be kept for a long time. - Classically, archives are made by copying data to magnetic tape for long-term storage. - -## Storage Locations - -This section describes the different storage locations and gives an overview of their properties. - -### Home Directories - -- **Location** `/fast/{users,groups,projects}/` (except for `work` and `scratch` sub directories) -- the user, group, or project home directory -- meant for documents, scripts, and programs -- default quota for data: default soft quota of 1 GB, hard quota of 1.5 GB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 10k files soft, 12k files hard -- snapshots are regularly created, see Section \ref{snapshot-details} -- nightly incremental backups are created, the last 5 are kept -- *Long-term strategy:* - users are expected to manage data life time independently and use best practice for source code and document management best practice (e.g., use Git). - When users/groups leave the organization or projects ends, they are expected to handle data storage and cleanup on their own. - Responsibility to enforce this is with the leader of a user's group, the group leader, or the project owner, respectively. - -### Work Directories - -- **Location** `/fast/{users,groups,projects}//work` -- the user, group, or project work directory -- meant for larger data that is to be used for a longer time, e.g., raw data, final sorted BAM file -- default quota for data: default soft quota of 1 TB, hard quota of 1.1 TB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 2 Mfile soft, 2.2M files hard -- no snapshots, no backup -- *Long-term strategy:* - When users/groups leave the organization or projects ends, they are expected to cleanup unneeded data on their own. - HPC IT can provide archival services on request. - Responsibility to enforce this is with the leader of a user's group, the group leader, or the project owner, respectively. - -### Scratch Directories - -- **Location** `/fast/{users,groups,projects}//scratch` -- the user, group, or project scratch directory -- **files will be removed 2 weeks after their creation** -- meant for temporary, potentially large data, e.g., intermediate unsorted or unmasked BAM files, data downloaded from the internet for trying out etc. -- default quota for data: default soft quota of 200TB, hard quota of 220TB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 2M files soft, 2.2M files hard -- no snapshots, no backup -- *Long-term strategy:* - as data on this volume is not to be kept for longer than 2 weeks, the long term strategy is to delete all files. - -## Snapshot Details - -Snapshots are made every 24 hours. -Of these snapshots, the last 7 are kept, then one for each day. - -## Backup Details - -Backups of the snapshots is made nightly. -The backups of the last 7 days are kept. - -## Archive Details - -BIH HPC IT has some space allocated on the MDC IT tape archive. -User data can be put under archive after agreeing with head of HPC IT. -The process is as describe in Section \ref{sop-data-archival}. +1. **Users** *(natural people)* +2. **Groups** *(Arbeitsgruppen)* with on leader and an optional delegate +3. **Projects** with one owner and an optional delegate + +Each user, group, and project can have storage folders in different locations. + +## Data Types and storage Tiers +Files stored on the HPC fall into one of three categories: + +1. **Home** folders store programs, scripts, and user config which are generally long-lived and very important files. +Loss of home data requires to redo manual work (like programming). + +2. **Work** folders store data of potentially large size which has a medium life time and is important. +Examples are raw sequencing data and intermediate results that are to be kept (e. g. sorted and indexed BAM files). +Work data requires time-consuming actions to be restored, such as downloading large amounts of data or long-running computation. + +3. **Scratch** folder store temporary files with a short life-time. +Examples are temporary files (e. g. unsorted BAM files). +Scratch data is created to be removed eventually. + +Ceph storage comes in two types which differ in their I/O speed, total capacity, and cost. +They are called **Tier 1** and **Tier 2** and sometimes **hot storage** and **warm storage**. +In the HPC filesystem they are mounted in `/data/cephfs-1` and `/data/cephfs-2`. +Tier 1 storage is fast, relatively small, expensive, and optimized for performance. +Tier 2 storage is slow, big, cheap, and built for keeping large files for longer times. +Storage quotas are imposed in these locations to restrict the maximum size of folders. + +### Home directories +**Location:** `/data/cephfs-1/home/` + +Only users have home directories on Tier 1 storage. +This is the starting point when starting a new shell or SSH session. +Important config files are stored here as well as analysis scripts and small user files. +Home folders have a strict storage quota of 1 GB. + +### Work directories +**Location:** `/data/cephfs-1/work/` + +Groups and projects have work directories on Tier 1 storage. +User home folders contain a symlink to their respective group's work folder. +Files shared within a group/project are stored here as long as they are in active use. +Work folders are generally limited to 1 TB per group. +Project work folders are allocated on an individual basis. + +### Scratch space +**Location:** `/data/cephfs-1/scratch/` + +Groups and projects have scratch space on Tier 1 storage. +User home folders contain a symlink to their respective group's scratch space. +Meant for temporary, potentially large data e. g. intermediate unsorted or unmasked BAM files, data downloaded from the internet etc. +**Files in scratch will be automatically removed 2 weeks after their creation.** +Scratch space is generally limited to 10 TB per group. +Projects are allocated scratch on an individual basis. + +### Tier 2 storage +**Location:** `/data/cephfs-2/` + +Groups and projects can be allocated additional storage on the Tier 2 system. +File quotas here can be significantly larger as it is much cheaper and more abundant than Tier 1. + +### Overview + +| Tier | Function | Path | Default Quota | +|:-----|:-----------------|:---------------------------------------------|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | +| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | +| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | +| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | +| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | +| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | + +## Snapshots and Mirroring +Snapshots are incremental copies of the state of the data at a particular point in time. +They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. +Depending on the location and Tier, CephFS creates snapshots in different frequencies and retention plans. +User access to the snapshots is documented in [this document](https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots). + +| Location | Path | Retention policy | Mirrored | +|:-------------------------|:-----------------------------|:--------------------------------|---------:| +| User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | +| Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | no | +| Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | +| Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | +| Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | + +Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center. +This provides an additional layer of security i. e. physical damage to the servers. ## Technical Implementation - As a quick (very) technical note: -There exists a file system `fast`. -This file system has three independent file sets `home`, `work`, `scratch`. -On each of these file sets, there is a dependent file set for each user, group, and project below directories `users`, `groups`, and `projects`. -`home` is also mounted as `/fast_new/home` and for each user, group, and project, the entry `work` links to the corresponding fileset in `work`, the same for scratch. -Automatic file removal from `scratch` is implemented using GPFS ILM. -Quotas are implemented on the file-set level. +### Tier 1 +- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` +- Currently 12 Nodes with 10 × 14 TB NVME SSD each + - 1.68 PB raw storage + - 1.45 PB erasure coded (EC 8:2) + - 1.23 PB usable (85 %, ceph performance limit) +- For typical CUBI use case 3 to 5 times faster I/O then the old DDN +- Two more nodes in purchasing process +- Example of flexible extension: + - Chunk size: 45.000 € for one node with 150 TB, i. e. ca. 300 €/TB + +### Tier 2 +- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` +- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. + - 6.6 PB raw + - 5.3 PB erasure coded (EC 8:2) + - 4.5 PB usable (85 %; Ceph performance limit) +- Nine more nodes in purchasing process with 5+ PB +- Very Flexible Extension possible: + - ca. 50 € per TB, 100 € mirrored, starting at small chunk sizes + +### Tier 2 mirror +Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment. diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index b8b978967..baa9a4f2d 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -1,3 +1,4 @@ +# Migration from old GPFS to new CephFS ## What is going to happen? Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system. That includes users' home directories, work directories, and work-group directories. @@ -10,50 +11,19 @@ The company selling it has terminated support which also means buying replacemen ## The new storage There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: -- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. +- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. - **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity. -The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access. +The **Hot storage** Tier 1 is reserved for files requiring frequent random access, user homes, and scratch. Tier 2 (**Warm storage**) should be used for everything else. Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost. So these are the three terminologies in use right now: -- Cephfs-1 = Tier 1 = Hot storage -- Cephfs-2 = Tier 2 = Warm storage - -### Snapshots and Mirroring -Snapshots are incremental copies of the state of the data at a particular point in time. -They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. - -Depending on the location and Tier, Cephfs utilizes snapshots in differ differently. -Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security. - -| Tier | Location | Path | Retention policy | Mirrored | -|:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| -| 1 | User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | -| 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | no | -| 1 | Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | -| 2 | Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | -| 2 | Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | - -User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots - -### Quotas - -| Tier | Function | Path | Default Quota | -|:-----|:---------|:-----|--------------:| -| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | -| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | -| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | -| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | -| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | -| 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/` | 4 TB | -| 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/` | On request | -| 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/` | On request | -| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | individual | - -There are no quotas on the number of files. +- Cephfs-1 = Tier 1 = Hot storage = `/data/cephfs-1` +- Cephfs-2 = Tier 2 = Warm storage = `/data/cephfs-2` + +There are no more quotas on the number of files. ## New file locations Naturally, paths are going to change after files move to their new location. @@ -195,28 +165,3 @@ Best practice and/or tools will be provided. !!! note The users' `work` space will be moved to the group's `work` space. - -## Technical details about the new infrastructure -### Tier 1 -- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` -- Currently 12 Nodes with 10 × 14 TB NVME/SSD each installed - - 1.68 PB raw storage - - 1.45 PB erasure coded (EC 8:2) - - 1.23 PB usable (85 %, ceph performance limit) -- For typical CUBI use case 3 to 5 times faster I/O then the old DDN -- Two more nodes in purchasing process -- Example of flexible extension: - - Chunk size: 45 kE for one node with 150 TB, i.e. ca. 300 E/TB - -### Tier 2 -- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` -- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. - - 6.6 PB raw - - 5.3 PB erasure coded (EC 8:2) - - 4.5 PB usable (85 %; Ceph performance limit) -- Nine more nodes in purchasing process with 5+ PB -- Very Flexible Extension possible: - - ca. 50 Euro per TB, 100 Euro mirrored, starting at small chunk sizes - -### Tier 2 mirror -Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index 294448a14..78ab31d8e 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -113,11 +113,11 @@ nav: - "Episode 3": first-steps/episode-3.md - "Episode 4": first-steps/episode-4.md - "Storage": + - "Storage Locations": storage/storage-locations.md + - "Automated Cleanup": storage/scratch-cleanup.md - "Storage Migration": storage/storage-migration.md - "Accessing Snapshots": storage/accessing-snapshots.md - "Querying Quotas": storage/querying-storage.md - - "Storage Locations": storage/storage-locations.md - - "Automated Cleanup": storage/scratch-cleanup.md - "Cluster Scheduler": - slurm/overview.md - slurm/background.md