Skip to content

Latest commit

 

History

History
259 lines (252 loc) · 9.95 KB

README.org

File metadata and controls

259 lines (252 loc) · 9.95 KB

TSCC and Snakemake

Content and Goal

Content:

  1. how to submit jobs in TSCC
  2. how to organize jobs using snakemake

Goal:

  1. present a global picture about jobs in TSCC
  2. provide a relatively complete reference

Installation of snakemake (three ways)

  • Install snakemake using conda (Install miniforge (fast and preferred) or conda)
    1. conda install -n base -c conda-forge mamba
    2. conda activate base
    3. mamba create -c conda-forge -c bioconda -n snakemake snakemake
  • conda install -c conda-forge -c bioconda snakemake in a given conda env
  • make install under the current directory, see Makefile for detail

Prepare the demo scripts

  1. git clone https://github.com/beyondpie/tools
  2. cd useSnakemake

Introduction to TSCC

TSCC, Triton Shared Computing Cluster in UCSD.

Login nodes:

  • when you enter tscc using SSH, you are assigned to these nodes
  • SSH Keys: fast access without password
    • like how we set github SSH Keys
    • See Generating SSH Keys in TSCC
  • Use SSH config to simplify ssh commands with a demo
    • You can paste this into the file ~/.ssh/config
    • Next time, you can use ssh tscc as a short command.
    • NOTE: replace all the contents within [] including [ and ]
    Host *
        ControlMaster auto
        ServerAliveInterval 5
        ServerAliveCountMax 2
        ControlPath ~/.ssh/control-%h-%p-%r
        ControlPersist yes
    
    Host tscc
        HostName [tscc login node addres]
        User [your name]
        IdentityFile ~/.ssh/[your private key file]
        

Computing nodes:

  • when you submit jobs (either interactive or not) with qsub
  • TSCC condo program
    • CPU Nodes:
      • The Gold node: 2 CPUs * 18 cores/CPU, 256GB RAM in toal (so 7GB/core for 36 cores in total)
      • The Plantinum node: 2 CPUs * 32 cores/CPU, 1TB RAM in total (so 15GB/core for 64 cores in total)
      • nodes=1:ppn=16:icelake:mem1024
        • will start your job with 16 CPU cores on the Platinum node (i.e., with 1TB mem)
        • only condo / home / glean / gpu-condo queue can use this
          • condo has 8 hours limitation.
          • home means the nodes we purchase
          • this is not ask for 1024 GB RAM, but to locate icelake with 1024GB node.
      • hotel nodes are not the Plantinum/Gold nodes
    • GPU Nodes:
      • CPU model: 2 CPUs *32 cores/CPU, 256GB RAM in total (15GB/core)
      • NVIDIA Ampere A100, 40GB GRAM

Data-intensive nodes:

  • named with dm1
  • specific ones for data-intensive works, like copy big data.
  • No needs to submit jobs, just run your commands.

Storage

  • OASIS: 800+ TB Data Oasis Lustre-based high performance parallel file system
    • Each use has 25TB space.
    • 90-day purge policy
    • Use touch [your file] to update the file timestamp
    • Use find . -exec touch {} \; to update the files in current directories.
    • Efficiently to handle large files; not good to write many small files.
  • HOME: ~, 100 GB per user
    • Jobs’ output should be written into this.
      • This NFS file system has only one server handling all the metadata and storage requirements.
  • PROJECT: condo storage we buy
    • Typically 300 TB shared by all the users in our lab
    • use df -h [project dir] to check the space information
  • TMPDIR
    • handle temp many files and will be removed after job is done

SU: how we pay to TSCC

1 SU: 1 core * 1 hour

  • core resources will be calculated once we submit jobs
  • time resources will be re-estimated
  • $250 (10K SUs):
    • 1 SU ~ $0.025; 40 SUs ~ $1; 200 SUs ~ Starbuck coffee

how to check money left:

  • gbalance -u [user name]
  • add one line alias mymoney="gbalance -u [user name]" into your ~/.bashrc file and source ~/.bashrc. Then use mymoney to check your status.

Queue: assign a queue when you submitting a job

  • hotel
    • max walltime: 168 hours (1 week); max cores/user: 128
  • home
    • max walltime: unlimited; max cores/user: unlimited
  • glean: free of charge but may be stoped by system at any time
    • max walltime: 8 hours; max cores/users: 1024
  • gpu-related queues:
    • gpu-hotel: like hotel
    • gpu-condo: max walltime: 8 hours; max cores/user: 84

Submitting jobs in TSCC

Job manager/schedular in HPC (High-Performace Computing) system

  • TORQUE Resource Manager (or Portable Batch System, PBS)
    • TSCC now uses this
  • Slurm workload manager

Typical PBS script

A draft of PBS script

#! /bin/bash
#PBS -q glean
#PBS -N test_pbs
#PBS -l nodes=1:ppn=1
#PBS -l walltime=[hh:mm:ss]
#PBS -o [output file]
#PBS -e [error file]
#PBS -V
#PBS -M [email address list]
#PBS -m abe
#PBS -A ren-group
[All the shell commands you want to have here]
  • Create a script like the one above then qsub [the_script]
  • Use qstat -u [user name] to get the status of the submitted job

Interactive job

  • qsub -I -q glean -l nodes=1:ppn=2 -l walltime=08:00:00
  • Add alias myjob="qsub -I -q glean -l nodes=1:ppn=2 -l walltime=08:00:00" to your ~/.bashrc, then source ~./bashrc. You can then use myjob to quickly start an interactive job without needing to remember the details.

Why we need it

  • I want to submit 1,000 jobs.
  • Some of them are failed, I need to rerun them.
  • I have a pipeline, which means jobs have dependencies

Jobs manager:

  • handle jobs’ dependencies
  • automatically run the pipeline from where it failed
  • if some intermedia files / scripts are updated, then automatically update all the later rules depend on them.

Snakemake = Snake[Python] + make[GNU make]

make: dependency-tracking build utilities, written in 1976

  • Still widely used now, especially GNU make
  • Drawback:
    • Lack of configuration file support, like yaml, json.
    • Lack of support for jobs on HPC
  • Snakemake may die in the future, but make should be still alive.

Snakemake:

  • written in Python, which makes it simple to use
  • support config files like yaml, json.
  • support HPC: both PBS and Slurm

How it looks like:

 ## optional config file
 configfile: "config.yaml"
 content = "say hi"
 samples = ["a", "b"]
 ## all the output files you want to have
 rule all:
     input:
        expand("flag/pre_{s}.done", s = samples)
        expand("flag/first_{s}.done", s = samples)

 ## then set up the rules about how to generate them
 rule pre:
     output:
        # s will be infered based on rule all output
        # here it will be a or b.
        # snakemake will run s=a and s=b in parallel if possible
        # touch will be automatically  generate the flag file
        # once the rule is done.
        touch("flag/pre_{s}.done")
     log:
        "log/{s}.log"
     shell:
        """
        # wildcards.s to get the a / b
        echo "pre:" {content} {wildcards.s} 2> {log}
        """
rule first:
    # snakemake will know that it depends on the output of pre
    # then the rule will run after pre
    input:
        "flag/pre_{s}.done"
    output:
        touch("flag/first_{s}.done")
    log:
        "log/{s}.log"
    shell:
        """
        echo "first:" {content} {wildcards.s} 2> {log}
        """
  • Then save the above into a file named demo.snakefile.
  • snakemake --snakefile demo.snakefile to run the snakemake.

Summary

  • snakefile
    • just python syntax + snakemake key words, such as config, rule, expand, wildcards and so on.
    • [OPTION but good in practice] config.yaml to claim variables
  • The input in rule all is used to claim all the final outputs
  • input and output of rules to organize the depencies of tasks.
  • wildcards
    • inferred based on file names
    • the key mechanism to run multiple jobs in parallel

Use Snakemake to control the jobs in TSCC

Use profile to setup the particular enviroment of HPC

Demo for PBS

  • mkdir profile
  • under profile, we create two files.
    • one is config.yaml.
      cluster-config: "profile/cluster.yaml"
      cluster: "qsub -N {cluster.jobname} -l nodes={cluster.nodes}:ppn={cluster.ppn},walltime={cluster.walltime} -A {cluster.account} -q {cluster.queue} -M {cluster.email} -m {cluster.mailon} -j {cluster.jobout} -e {cluster.logdir} -V "
      jobs: 100
              
    • one is cluster.yaml .
      __default__:
          jobname: "{rule}.{wildcards}"
          nodes: 1
          ppn: 1
          walltime: "02:00:00"
          account: "ren-group"
          queue: "glean"
          email: "[email protected]"
          mailon: "ae"
          jobout: "oe"
          log: "{rule}.{wildcards}.tscc.log"
      pre:
          ppn: 1
          queue: "glean"
          walltime: 00:10:00
      first:
          ppn: 1
          queue: "glean"
          walltime: 00:10:00
              
    • Then snakemake --snakefile demo.snakemake --profile profile to submit jobs into PBS

Tips

  • Use screen in login nodes, then start snakemake
  • Use glean to test your pipeline with no payment
  • Use glean to run your snakemake with no payment
    • As long as the job can be finished within 8 hours
    • If jobs fail by unknown reason, just rerun snakemake.