Skip to content

Single-sample pipeline testing on Biowulf #1

@LauraEgolf

Description

@LauraEgolf

Independent user test of GATK-SV single-sample pipeline on Biowulf

Ben previously configured the GATK-SV single-sample pipeline for Biowulf and tested using a COVNET WGS sample. I ran an independent test using one of the osteosarcoma WGS samples, CCSS_1000278_A.

The single-sample pipeline can process a single test sample jointly with a reference panel. It reduces computational time somewhat by using certain precomputed inputs, but this mode is still is much less computationally efficient than the cohort/batch mode (best used for 100+ samples). Here we used the reference panel of 156 samples from 1000 Genomes which is provided by GATK-SV (the same panel is used in the example Terra workspace).

Note that the single-sample pipeline generally will only work for PCR-free WGS samples, in my experience.

Working directory: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/

Build inputs

Help message for build_inputs.py:

positional arguments:  
    input_values_directory Directory containing input value map JSON files
    template_path          Path to template directory or file (directories will be processed recursively)
    output_directory       Directory to create output files in

optional arguments:
    -h, --help          show this help message and exit
    -a ALIASES	        Aliases for input value bundles
    --log-info          Show INFO-level logging messages. Use for troubleshooting.

Example from GATK-SV documentation:

# Build test files for the single-sample workflow
python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test/GATKSVPipelineSingleSample \
    inputs/build/NA19240/test_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'

Ben's example:

BASE_DIR=/data/COVID_WGS/StructuralVariantCalling/gatk-sv 
scripts/inputs/build_inputs.py \
	${BASE_DIR}/inputs/values \
	${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
	${BASE_DIR}/inputs/build/SC695914/test \
	-a '{ "single_sample" : "test_single_sample_SC695914.json", "ref_panel" : "ref_panel_1kg" }'

Setup for my test run:

BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv 
${BASE_DIR}/scripts/inputs/build_inputs.py \
	${BASE_DIR}/inputs/values_20240405 \
	${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
	${BASE_DIR}/inputs/build_20240405/CCSS_1000278_A \
	-a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'

Run pipeline and troubleshoot

First test run:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
swarm run_gatk-sv_single_sample_no_melt.swarm

Got error with GATK jar.
Added GATK to the modules list (in swarm command) and re-ran - still got the same error.

Commented this line out of all GATK-SV WDLs (here: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv):
export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}.
Solved the GATK jar error.

Got error with Google container registry (ubuntu image); tested again after initializing gcloud:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
sinteractive  # Can't load google-cloud-sdk on login nodes
module load google-cloud-sdk
gcloud init  # Follow prompts - login ([email protected]), select project (nih-nci-dceg-covnet-wgs)
swarm run_gatk-sv_single_sample_no_melt.swarm

Same error.

Update Docker inputs to avoid using GCR ubuntu container

Updated values file to avoid the problematic container:

/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv/inputs/values_20240405/dockers.json

Replaced:
"linux_docker": "marketplace.gcr.io/google/ubuntu1804",
With:
"linux_docker": "ubuntu:18.04",

Re-ran build_inputs.py, and submitted test run - latest run succeeded.

Results

Runtime

Runtime estimate from the example Terra workspace, I assume this is using GCP preemptible instances:

image

Note the 18 GB file size is based on a ~30x CRAM file, here we used a 36x BAM file (77 GB).

Runtime on Biowulf: 9 hours 13 minutes (based on the main swarm job runtime, and the timestamps on the output directory).

Output structure

An explanation of the output can be found on the Terra workspace for the GATK-SV single-sample pipeline:

image

Results from this run: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/gatk-sv-results

QC checks

The GATK-SV single-sample pipeline has several built-in QC checks. This sample mostly passed the QC checks, which is a good sign (I've tried running this pipeline in the past, and my PCR+ samples failed the QC checks horribly). However, it was flagged for certain SV counts being outside the 'normal' ranges defined by the pipeline developers:

image

In particular, the count of deletions >100 kb is extremely high. I will investigate this further.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions