Skip to content

Latest commit

 

History

History
265 lines (190 loc) · 21.7 KB

usage.md

File metadata and controls

265 lines (190 loc) · 21.7 KB

ausarg/pipesnake: Usage

Introduction

To efficiently utilise and customise the workflow, we recommend the user to learn more about Nextflow and workflow configuration and parameter prioritisation.

In general, most of the workflow parameters have default values to be used as is for most cases. These parameters are configured in the configuration file (base.config). All these parameters can be customised by either providing a new config file and passing it to the Nextflow running command using -c or -config (core nextflow parameter described below). The other way is to pass these parameters using two hyphens --. Details about workflow parameters and their default values are available below.

Pipeline parameters

You can get the full ist of the pipeline parameters with their default values using the following command:

nextflow run ausarg/pipesnake --help --show_hidden_params

However, the popular parameters are described below and can be passed to the pipeline using two hyphens -- followed by the parameter name and then its value.

e.g. --input /scratch/testdata/sample_sheet.csv --disable_filter true.

Main options

Parameter Type Description
--input string Path to comma-separated file containing information about the samples in the experiment.
--outdir string The output directory where the results will be saved.
You have to use absolute paths to storage on Cloud infrastructure.
--disable_filter boolean Default is true.
Disable bbmap filtration process.
This speed up the performance. When enabled, the reference genome parameter is required.
--reference_genome string Path to the filter sequences FASTA file.
--blat_db string Path to the target sequences FASTA file.
--tree_method string Default is iqtree.
The supported options are: iqtree or raxml.
--trim_alignment boolean Default is false.
Trim initial MAFFT alignments.
--batching_size integer Default is 250.
Number of alignment files to be processed sequentially in batches
to avoid submitting a large number of jobs when using HPCs.
--trinity_scratch_tmp boolean Default is true.
Trinity generates large number of intermediate files.
This can be an issue for some HPCs that limits the file number for each user.
This option will make trinity writes to the /tmp directory on the compute node
then copy the compressed output directory (not the fasta) to the working directory to avoid this issue.

Tools arguments for each stage of the pipeline

italic variables in the default values represent other parameters.

Parameter Type Description
--phylogeny_make_alignments_minsamp integer Default is 4
Minimum number of samples to constitute an alignment Some phylogeny building methods rely on a minimum number of samples in an alignment. To estimate bootstraps for a genetree or include a genetree in an ASTRAL analysis, the minimum number of samples is 4. Suggested Usage: 4
--phylogeny_make_alignments_args string Default is --minsamp phylogeny_make_alignments_minsamp
The arguments to be passed to process (phylogeny_make_alignments). The value 4 is taken from the parameter
--trinity_postprocessing_args string Default is None
--trimmomatic_clean_pe_args string Default is -phred33 LEADING
The arguments to be passed to process (trimmomatic_clean_pe).
--trimmomatic_clean_se_args string Default is -phred33 LEADING
The arguments to be passed to process (trimmomatic_clean_se).
--prepare_samplesheet_args string The arguments to be passed to process (prepare_samplesheet).
--blat_parser_evalue number Default is 1e-10
the E-value score for filtering BLAT hits E-value scores designate the number of expected hits as a result of chance. Simply put, the lower the score, the fewer (but better) the matches. See here for a detailed explanation.
--blat_parser_match integer Default is 80
--parse_blat_results_args string Default is --evalue blat_parser_evalue --match blat_parser_match
The arguments to be passed to process (parse_blat_results).
--quality_2_assembly_args string The arguments to be passed to process (quality_2_assembly).
--samplesheet_check_args string The arguments to be passed to process (samplesheet_check).
--prepare_adaptor_args string The arguments to be passed to process (prepare_adaptor).
--bbmap_reformat_minconsecutivebases number Default is 100.0
--bbmap_reformat_dotdashxton boolean Default is True
--bbmap_reformat_fastawrap number Default is 32000.0
--bbmap_reformat_args string The arguments to be passed to process (bbmap_reformat).
--bbmap_reformat2_args string Default is minconsecutivebases=bbmap_reformat_minconsecutivebases dotdashxton=bbmap_reformat_dotdashxton fastawrap=bbmap_reformat_fastawrap
The arguments to be passed to process (bbmap_reformat2).
--convert_phyml_args string The arguments to be passed to process (convert_phyml).
--preprocessing_args string Default is None
--bbmap_dedupe_args string Default is None
--bbmap_filter_minid number Default is 0.75
Minimum identity to reference sequence to retain a read Discards reads that do not map to a reference target with at least the indicated identity score. This was designed to quickly filter off-target reads such as mtDNA. Suggested Usage: 0.75. It's important to note that application of the minid is more complicated than it looks. BBMAP also has an idfilter=X option which is more literal. Read the BBMAP documentation and this forum response by Brian Bushnell for more nuance.
--bbmap_filter_mem integer Default is 2
Specify memory use for mapping/filtering (Gb)
--bbmap_filter_args string Default is minid=0.75
The arguments to be passed to process (bbmap_filter). *bbmap_filter_minid`*
--perl_cleanup_args string Default is -pi -w -e "s/!/N/g;"
The arguments to be passed to process (perl_cleanup).
--concatenate_args string The arguments to be passed to process (concatenate).
--merge_trees_args string The arguments to be passed to process (merge_trees).
--trimmomatic_clean_minlength integer Default is 36
Minimum read length to be retained after trimming Discard reads shorter than this minimum threshold after adaptor/barcode trimming. Suggested Usage: 36
--trimmomatic_clean_trail integer Default is 3
Remove trailing bases below indicated quality Discard trailing bases below this minimum quality threshold. Suggested Usage: 3
--trimmomatic_clean_head integer Default is 3
Remove leading bases below indicated quality Discard leading bases below this minimum quality threshold. Suggested Usage: 3
--trimmomatic_clean_qual integer Default is 15
Minimum quality score of 4-base sliding window Cut the read when the 4-base sliding window quality score drops below this threshold. Suggested Usage: 15
--trimmomatic_args string Default is -phred33
The arguments to be passed to process (trimmomatic).
--make_rgb_kept_tags string Default is easy_recip_match,complicated_recip_match
--make_prg_args string Default is --kept-tags make_rgb_kept_tags
The arguments to be passed to process (make_prg).
--gblocks_b1 number Default is 0.5
Minimum number of sequences to be identified as a conserved site This establishes the minimum threshold for identifying a conserved site. The value must be greater than half the number of sequences, e.g. min value is 0.5 and we'll round that up
--gblocks_b2 number Default is 0.85
Minimum number of sequences to be identified as a flanking site Flanking sites are assessed until they make a series of conserved positions at both flanks relative to the contiguous nonconserved sites. This value must be equal to or greater than the value of b1
--gblocks_b3 integer Default is 8
Maximum number of contiguous nonconserved sites allowed Stretches of contiguous nonconserved sites greater than b3 are rejected. Greater b3 values increase the selected number of positions
--gblocks_b4 number Default is 10
Minimum length of a sequence block after gap cleaning After gap cleaning, sequence blocks less than the indicated value are rejected
--gblocks_args string Default is -t=DNA
-b3=gblocks_b3
-b4=gblocks_b4
-b5=h -p=n
The arguments to be passed to process (gblocks).
--testing_args string Default is None
--trinity_normalize_reads boolean Default is False
Normalize the read pool, discarding excess coverage Depending on the sequencing effort, there may be excess reads (above desired coverage) that will slow down computation by requiring additional memory. New versions of trinity normalize reads do this by default, and it's highly recommended here.
--trinity_processed_header string Default is contig
Prefix for a contig of assembled reads Naming convention for assembled contigs. Suggested Usage: contig
--trinity_args string Default is --seqType fq --NO_SEQTK
The arguments to be passed to process (trinity). check the wiki
--iqtree_args string Default is --quiet -B 1000
The arguments to be passed to process (iqtree).
--aster_args string The arguments to be passed to process (aster).
--macse_stop number Default is 10.0
--macse_program string Default is refineLemmon
--macse_refine_alignment_optim number Default is 1.0
--macse_refine_alignment_local_realign_init number Default is 0.1
--macse_refine_alignment_local_realign_dec number Default is 0.1
--macse_refine_alignment_fs number Default is 10.0
--macse_args_refine string Default is -stop macse_stop
-prog refineAlignment
The arguments to be passed to process (macse).
--macse_args_export string Default is -prog exportAlignment
-stop macse_stop`
The arguments to be passed to process (macse).
--macse_args_align string Default is -prog alignSequences
-stop_lr macse_stop`
The arguments to be passed to process (macse).
--macse_args_refineLemmon string Default is -prog refineAlignment -optim macse_refine_alignment_optim -local_realign_initmacse_refine_alignment_local_realign_init -local_realign_dec macse_refine_alignment_local_realign_dec -fs macse_refine_alignment_fs
The arguments to be passed to process (macse).
--mafft_maxiterate integer Default is 1000
Number of cycles of iterative refinement Iterative refinement helps to improve the alignment process, at the cost of additional time. 1000 iterations is sufficient for most exercises
--mafft_args string Default is --maxiterate mafft_maxiterate
--globalpair
--adjustdirection
--quiet
The arguments to be passed to process (mafft).
--raxml_runs integer Default is 100
--raxml_args string Default is -m GTRCAT -f a -n
The arguments to be passed to process (raxml).
--blat_parser_match number Default is 80.0
Minimum required percent (0-100) match between contig and target Discards contig-to-target matches below minimum threshold provided. Values closer to 100 require greater similarity. Suggested Usage: 80
--blat_parser_evalue number Default is 1e-10
Minimum required e-value for match between contig and target Discards contig-to-target matches greater than threshold provided. Values closer to 0 (smaller) require greater similarity. Suggested Usage: 1e-10
--blat_args string Default is -out=blast8
The arguments to be passed to process (blat).
--pear_args string The arguments to be passed to process (pear).
--sed_args string The arguments to be passed to process (sed).

Resources for each stage of the pipeline

You can customise the resources requested for each stage of the pipeline including cpus, memory, and walltime using the following parameters: process-name_cpus, process-name_cpus, and process-name_walltime, respectively.

process-name can be any one of the following processes:

perl_cleanup, phylogeny_make_alignments, preprocessing, trimmomatic_clean_pe, trinity_postprocessing, blat, bbmap_reformat, gblocks, parse_blat_results, aster, bbmap_reformat2, convert_phyml, trimmomatic, iqtree, concatenate, trinity, saved_output, bbmap_filter, bbmap_dedupe, prepare_adaptor, mafft, sed, pear, merge_trees, raxml, trimmomatic_clean_se, make_prg, macse, quality_2_assembly

Examples:

--perl_cleanup_cpus 4 --blat_memory 8.GB --mafft_walltime 9.h

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run ausarg/pipesnake --input samplesheet.csv --outdir <OUTDIR> -profile docker

This will launch the pipeline with the docker configuration profile. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work                # Directory containing the nextflow working files
<OUTIDR>            # Finished results in the specified location (defined with --outdir)
.nextflow_log       # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

nextflow run ausarg/pipesnake

Reproducibility

It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the ausarg/pipesnake releases page and find the latest version number - numeric only (eg. 1.3.1). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1.

This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.

Core Nextflow arguments

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

-profile

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.

We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.

The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.

Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.

If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH. This is not recommended.

  • docker
    • A generic configuration profile to be used with Docker
  • singularity
    • A generic configuration profile to be used with Singularity
  • podman
    • A generic configuration profile to be used with Podman
  • shifter
    • A generic configuration profile to be used with Shifter
  • charliecloud
    • A generic configuration profile to be used with Charliecloud
  • conda
    • A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
  • test
    • A profile with a complete configuration for automated testing
    • Includes links to test data so needs no other parameters

-resume

Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

-c

Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.

-work-dir

Specify the path to your preferred working directory, instead of your current working directory.

Custom configuration

Updating containers

The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. If for some reason you need to use a different version of a particular tool with the pipeline then you just need to identify the process name and override the Nextflow container definition for that process using the withName declaration. For example, in the nf-core/viralrecon pipeline a tool called Pangolin has been used during the COVID-19 pandemic to assign lineages to SARS-CoV-2 genome sequenced samples. Given that the lineage assignments change quite frequently it doesn't make sense to re-release the nf-core/viralrecon everytime a new version of Pangolin has been released. However, you can override the default container used by the pipeline by creating a custom config file and passing it as a command-line argument via -c custom.config.

  1. Check the default version used by the pipeline in the module file for Pangolin

  2. Find the latest version of the Biocontainer available on Quay.io

  3. Create the custom config accordingly:

    • For Docker:

      process {
          withName: PANGOLIN {
              container = 'quay.io/biocontainers/pangolin:3.0.5--pyhdfd78af_0'
          }
      }
    • For Singularity:

      process {
          withName: PANGOLIN {
              container = 'https://depot.galaxyproject.org/singularity/pangolin:3.0.5--pyhdfd78af_0'
          }
      }
    • For Conda:

      process {
          withName: PANGOLIN {
              conda = 'bioconda::pangolin=3.0.5'
          }
      }

NB: If you wish to periodically update individual tool-specific results (e.g. Pangolin) generated by the pipeline then you must ensure to keep the work/ directory otherwise the -resume ability of the pipeline will be compromised and it will restart from scratch.

Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'