docs: fixed typos, added machine learning example & link to migration

bihealth · Mar 11, 2024 · 9c86bcd · 9c86bcd
1 parent d3884f6
commit 9c86bcd
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 10 deletions.
diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md
@@ -3,11 +3,13 @@ This document describes the forth iteration of the file system structure on the
 It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage.
 For now, the third-generation file system is still mounted at `/fast`.
 
+**The old, third-generation filesystem will be decommissioned soon, please consult the [document describing the migration process](storage-migration.md)!**
+
 ## Organizational Entities
 There are the following three entities on the cluster:
 
-1. **Users** *(natural people)*
-2. **Groups** *(Arbeitsgruppen)* with on leader and an optional delegate
+1. **Users** *(real people)*
+2. **Groups** *(Arbeitsgruppen)* with one leader and an optional delegate
 3. **Projects** with one owner and an optional delegate
 
 Each user, group, and project can have storage folders in different locations.

diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md
@@ -20,14 +20,15 @@ Both file systems are based on the open-source, software-defined [Ceph](https://
 Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost.
 
 So these are the three terminologies in use right now:
+
 - Cephfs-1 = Tier 1 = Hot storage = `/data/cephfs-1`
 - Cephfs-2 = Tier 2 = Warm storage = `/data/cephfs-2`
 
 There are no more quotas on the number of files.
 
 ## New file locations
 Naturally, paths are going to change after files move to their new location.
-Due to the increase in storage quality options, there will be some more more folders to consider.
+Due to the increase in storage quality options, there will be some more folders to consider.
 
 ### Users
 - Home on Tier 1: `/data/cephfs-1/home/users/<user>`
@@ -36,7 +37,7 @@ Due to the increase in storage quality options, there will be some more more fol
 
 !!! warning
     User work & scratch spaces are now part of the user's group folder.
-    This means, groups should coordinate internally to distribute their allotted quota evenly among users.
+    This means, groups should coordinate internally to distribute their allotted quota according to each user's needs.
 
 The implementation is done _via_ symlinks created by default when the user account is moved to its new destination.
 
@@ -104,7 +105,7 @@ Your mileage may vary but there is a basic principle that remains true for all p
 
 #### DNA sequencing (WES, WGS)
 
-Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150  GB of storage, Whole Exome Sequencing between 6 and 30  GB.
+Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150  GB of storage, Whole Exome Sequencing files occupy between 6 and 30  GB.
 These large files require considerable I/O resources for processing, in particular for the mapping step.
 A prudent workflow for these kind of analysis would therefore be the following:
 
@@ -126,10 +127,11 @@ The count files are much smaller than the raw data or the mapped data, so they c
 A typical workflow would be:
 
 1. Copy your `fastq` files from Tier 2 to Tier 1.
-2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1.
-3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example.
-4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2.
-5. **Remove raw data, bam & count files from Tier 1.**
+2. Perform raw data quality control, and store the outcome on Tier 2.
+3. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1.
+4. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example.
+5. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2.
+6. **Remove raw data, bam & count files from Tier 1.**
 
 !!! tip
     If using `STAR`, don't forget to use your `scratch` area for transient operations.
@@ -142,13 +144,19 @@ Large raw files need to be processed once and only the outcome of the processing
 Therefore, a typical workflow would be:
 
 1. Copy your `fastq` files from Tier 2 to Tier 1.
-2. Perform raw data QC (for example with `fastqc`).
+2. Perform raw data QC, and store the results on Tier 2. 
 3. Get the count matrix, e.  g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1.
 4. **Remove raw data, bam & count files from Tier 1.**
 5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`.
 
 #### Machine learning
 
+There is no obvious workflow that covers most used cases for machine learning. 
+However,
+
+- Training might be done on scratch where data access is quick and data size not as constrained as on work space. But files will disappear after 14 days.
+- Some models can be updated with new data, without needing to keep the whole dataset on Tier 1.
+
 ## Data migration process from old `/fast` to CephFS 
 1. Administrative preparations  
     1. HPC-Access registration (PIs will receive in invite mail)