-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Independent user test of GATK-SV single-sample pipeline on Biowulf
Ben previously configured the GATK-SV single-sample pipeline for Biowulf and tested using a COVNET WGS sample. I ran an independent test using one of the osteosarcoma WGS samples, CCSS_1000278_A.
The single-sample pipeline can process a single test sample jointly with a reference panel. It reduces computational time somewhat by using certain precomputed inputs, but this mode is still is much less computationally efficient than the cohort/batch mode (best used for 100+ samples). Here we used the reference panel of 156 samples from 1000 Genomes which is provided by GATK-SV (the same panel is used in the example Terra workspace).
Note that the single-sample pipeline generally will only work for PCR-free WGS samples, in my experience.
Working directory: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/
Build inputs
Help message for build_inputs.py:
positional arguments:
input_values_directory Directory containing input value map JSON files
template_path Path to template directory or file (directories will be processed recursively)
output_directory Directory to create output files in
optional arguments:
-h, --help show this help message and exit
-a ALIASES Aliases for input value bundles
--log-info Show INFO-level logging messages. Use for troubleshooting.
Example from GATK-SV documentation:
# Build test files for the single-sample workflow
python scripts/inputs/build_inputs.py \
inputs/values \
inputs/templates/test/GATKSVPipelineSingleSample \
inputs/build/NA19240/test_my_ref_panel \
-a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'
Ben's example:
BASE_DIR=/data/COVID_WGS/StructuralVariantCalling/gatk-sv
scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build/SC695914/test \
-a '{ "single_sample" : "test_single_sample_SC695914.json", "ref_panel" : "ref_panel_1kg" }'
Setup for my test run:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values_20240405 \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240405/CCSS_1000278_A \
-a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'
Run pipeline and troubleshoot
First test run:
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
swarm run_gatk-sv_single_sample_no_melt.swarm
Got error with GATK jar.
Added GATK to the modules list (in swarm command) and re-ran - still got the same error.
Commented this line out of all GATK-SV WDLs (here: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv
):
export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}
.
Solved the GATK jar error.
Got error with Google container registry (ubuntu image); tested again after initializing gcloud:
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
sinteractive # Can't load google-cloud-sdk on login nodes
module load google-cloud-sdk
gcloud init # Follow prompts - login ([email protected]), select project (nih-nci-dceg-covnet-wgs)
swarm run_gatk-sv_single_sample_no_melt.swarm
Same error.
Update Docker inputs to avoid using GCR ubuntu container
Updated values file to avoid the problematic container:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv/inputs/values_20240405/dockers.json
Replaced:
"linux_docker": "marketplace.gcr.io/google/ubuntu1804",
With:
"linux_docker": "ubuntu:18.04",
Re-ran build_inputs.py, and submitted test run - latest run succeeded.
Results
Runtime
Runtime estimate from the example Terra workspace, I assume this is using GCP preemptible instances:

Note the 18 GB file size is based on a ~30x CRAM file, here we used a 36x BAM file (77 GB).
Runtime on Biowulf: 9 hours 13 minutes (based on the main swarm job runtime, and the timestamps on the output directory).
Output structure
An explanation of the output can be found on the Terra workspace for the GATK-SV single-sample pipeline:

Results from this run: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/gatk-sv-results
QC checks
The GATK-SV single-sample pipeline has several built-in QC checks. This sample mostly passed the QC checks, which is a good sign (I've tried running this pipeline in the past, and my PCR+ samples failed the QC checks horribly). However, it was flagged for certain SV counts being outside the 'normal' ranges defined by the pipeline developers:

In particular, the count of deletions >100 kb is extremely high. I will investigate this further.