-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-sample pipeline testing on Biowulf #1
Comments
"Clean" test to check GitHub repoI repeated the single sample test to ensure that all the necessary changes are synced with GitHub, since some local changes haven't been pushed yet. I used a new base directory: The example inputs require the Broad resource files stored in Set up new test run: BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514/CCSS_1000278_A_newtest \
-a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
swarm run_gatk-sv_single_sample_no_melt_20240514.swarm UPDATE: This test failed due to cromwell-related errors. I tested with a different config file (biowulf-swarm.conf instead of biowulf-core.conf), which resolved the problem. The osteo-testing branch is up-to-date and succeeded for at least one test run - but other test runs failed, as described below. |
Setting up inputs for multiple runs of the single-sample pipelineThis is an example of the process I followed to set up multiple runs of the single-sample pipeline, for the 95 osteo WGS samples. We'd want to improve this process if we decide we'll be running the single-sample pipeline a lot (only applicable for small batches of samples - for batches of >100, we would use cohort mode). Create csv file with sample names and BAM info: cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json
ls /data/DCEG_Pediatric_Sarcomas/GenCompass/pediatric_sarcoma_analysis_id/workflow_results/fq2bam/*/*.bam > bam_list.txt
echo "NAME,SAMPLE_ID,BAM_CRAM" > bam_info.csv
cat bam_list.txt | while read -r bampath; do
samplename=`basename $bampath | sed s/.bam//`
echo ${samplename} >> sample_list.txt # Will use this sample list in later commands
echo ${samplename}","${samplename}","${bampath} >> bam_info.csv
done
rm bam_list.txt Run Ben's Python script (slightly modified) to create input json files: # Create a copy of the values directory to output the new json files
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs
cp -r values values_20240514_multisample_test
cd test_create_multisample_json
python gatk-sv_batch_input.py --input bam_info.csv --template single_sample_input_template.json --output_directory ../values_20240514_multisample_test/ Run build_inputs.py (one command per sample) to create build files: # Example for one sample:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values_20240514_multisample_test \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514_multisample_test/OSTE_OSETO0001579_A \
-a '{ "single_sample" : "OSTE_OSETO0001579_A_input", "ref_panel" : "ref_panel_1kg" }'
### Loop across multiple samples:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt
cat $sample_list | while read -r samplename; do
alias_string='{ "single_sample" : "'${samplename}'_input", "ref_panel" : "ref_panel_1kg" }'
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values_20240514_multisample_test \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514_multisample_test/${samplename} \
-a "$alias_string"
done And, finally, write a swarm command for each sample: cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-core.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test.swarm
done
## UPDATE: use biowulf-swarm.conf instead of biowulf-core.conf (avoid cromwell errors caused by multiple jobs trying to write at the same time)
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-swarm.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig.swarm
done
# Manually added header to swarm file
## Submit first 3 samples in a test swarm
swarm run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig_first3.swarm |
Results from testing multiple single-sample runsUsing the swarm file I created (see previous comment), I ran two sets of samples on the single-sample pipeline. (I submitted the first set, then the second set several hours later).
Most of these samples failed (more details below). Of the 13 samples I submitted, only the first sample succeeded. Also, Biowulf notified me of a "short job" warning (I think I would have received this warning even if the jobs had succeeded):
First set (n=3 samples)26632587_0 - Succeeded (11 hr 11 min). Note this was the same sample from my original test. 26632587_1 - Canceled - Swarm job got stuck at AnnotateVcf and continued printing repetitive errors, even though no subjobs were running. There were no subjob failures. This is the first error that appeared:
26632587_2 - This sample failed at the gCNV step:
Second set (n=10 samples)All 10 samples failed with similar errors. All stopped at the CombineBatches step of MakeCohortVcf. None of the subjobs failed, but the workflow stopped running. Example error from
I reviewed the other swarm logs, which all had similar errors: |
Test multiple runs again with unique cromwell databaseBen developed a new config that creates a unique cromwell database per swarm job ( I copied this config to Create a new swarm file: sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID.swarm
done I also increased the threads per swarm job to 32 because the swarm jobs have been using more than the 8 allocated CPUs. New swarm header: #SWARM --logdir swarm_logs
#SWARM --threads-per-process 8
#SWARM --gb-per-process 50
#SWARM --time 24:00:00
#SWARM --module cromwell,singularity,GATK
#SWARM --sbatch "--export SINGULARITY_CACHEDIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/singularity_cache" Tested two samples; both stalled due to checkalive issue. Update config to use squeue to check if job is aliveSwitched to a new config file to avoid job hanging issue. Uses squeue instead of dashboard_cli to check job status. cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id-checkalive.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID_checkAlive.swarm
done Test runs and errorsSubmitted 3 samples, which all succeeded. Submitted 10 additional samples. 6 succeeded, but 4 samples failed during GenotypeBatch because the Docker request limit was exceeded. Example (swarm_27104311_5.o):
Additionally, I again received "Short job" warnings from Biowulf:
RuntimeOf the 9 samples that succeeded, total runtimes were ~1-2 days (much longer than previous tests). It seems the runtime is variable based on how busy Biowulf is, because cromwell spins up many short jobs and have to wait to queue. I recommend setting a walltime limit of at least 72 hours for the swarm. QC ChecksWill post results of QC review later. |
Independent user test of GATK-SV single-sample pipeline on Biowulf
Ben previously configured the GATK-SV single-sample pipeline for Biowulf and tested using a COVNET WGS sample. I ran an independent test using one of the osteosarcoma WGS samples, CCSS_1000278_A.
The single-sample pipeline can process a single test sample jointly with a reference panel. It reduces computational time somewhat by using certain precomputed inputs, but this mode is still is much less computationally efficient than the cohort/batch mode (best used for 100+ samples). Here we used the reference panel of 156 samples from 1000 Genomes which is provided by GATK-SV (the same panel is used in the example Terra workspace).
Note that the single-sample pipeline generally will only work for PCR-free WGS samples, in my experience.
Working directory:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/
Build inputs
Help message for build_inputs.py:
Example from GATK-SV documentation:
Ben's example:
Setup for my test run:
Run pipeline and troubleshoot
First test run:
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405 swarm run_gatk-sv_single_sample_no_melt.swarm
Got error with GATK jar.
Added GATK to the modules list (in swarm command) and re-ran - still got the same error.
Commented this line out of all GATK-SV WDLs (here:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv
):export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}
.Solved the GATK jar error.
Got error with Google container registry (ubuntu image); tested again after initializing gcloud:
Same error.
Update Docker inputs to avoid using GCR ubuntu container
Updated values file to avoid the problematic container:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv/inputs/values_20240405/dockers.json
Replaced:
"linux_docker": "marketplace.gcr.io/google/ubuntu1804",
With:
"linux_docker": "ubuntu:18.04",
Re-ran build_inputs.py, and submitted test run - latest run succeeded.
Results
Runtime
Runtime estimate from the example Terra workspace, I assume this is using GCP preemptible instances:
Note the 18 GB file size is based on a ~30x CRAM file, here we used a 36x BAM file (77 GB).
Runtime on Biowulf: 9 hours 13 minutes (based on the main swarm job runtime, and the timestamps on the output directory).
Output structure
An explanation of the output can be found on the Terra workspace for the GATK-SV single-sample pipeline:
Results from this run:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/gatk-sv-results
QC checks
The GATK-SV single-sample pipeline has several built-in QC checks. This sample mostly passed the QC checks, which is a good sign (I've tried running this pipeline in the past, and my PCR+ samples failed the QC checks horribly). However, it was flagged for certain SV counts being outside the 'normal' ranges defined by the pipeline developers:
In particular, the count of deletions >100 kb is extremely high. I will investigate this further.
The text was updated successfully, but these errors were encountered: