Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version #1

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

First version #1

wants to merge 26 commits into from

Conversation

gavin-peng
Copy link
Collaborator

No description provided.

@gavin-peng
Copy link
Collaborator Author

see ticket https://jira.oicr.on.ca/browse/GRD-833

Copy link

@pruzanov pruzanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a first implementation I would strongly recommend to avoid using separate wdl files. It will add additional maintenance headaches, especially given that the files are almost identical. I see that in this case we could introduce a mode variable which would route the execution two one of two tasks - one for PERL and another for R mode. We had a similar approach implemented in bclconvert wf which allows choosing between hpc and dragen modes.

@lheisler
Copy link

lheisler commented Jan 9, 2025

I'm also inclined to choose and implement just one of the versions, perl or java. I'm working with CHUM now to get their vardict output and we can run on a test sample to see what each version generates. I'll also investigate a bit more to better understand if it is clear that a specific version is being used in the genpipes pipelines that CHUM is using

@gavin-peng
Copy link
Collaborator Author

Yes we are in the process of choosing one of perl or java version. For a small inputs they generate indentical output, the problem is for large inputs they both have outof memory issue. I have tested java multiple times increasing memory, last attempt used 512G job memory with > 300G inputs from CHUM, still failed because of memory after 60h 32m.
The vardict github page claims java version is 10X faster, for now I haven't observed huge difference in speed.

@gavin-peng
Copy link
Collaborator Author

Since this is a first implementation I would strongly recommend to avoid using separate wdl files. It will add additional maintenance headaches, especially given that the files are almost identical. I see that in this case we could introduce a mode variable which would route the execution two one of two tasks - one for PERL and another for R mode. We had a similar approach implemented in bclconvert wf which allows choosing between hpc and dragen modes.

The perl version now removed

@gavin-peng gavin-peng closed this Jan 20, 2025
@gavin-peng gavin-peng reopened this Jan 20, 2025
@gavin-peng gavin-peng requested a review from pruzanov January 20, 2025 18:16
Copy link

@pruzanov pruzanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved, but recommend modifying calculate.sh test script

tests/calculate.sh Outdated Show resolved Hide resolved
vardict.wdl Outdated Show resolved Hide resolved
@lheisler
Copy link

include the parameter
-th 8
which will give opportunity for the software to use multiple slots on the node

i tested this on one small dataset and it reduced the processing time from
166m to 33m

@lheisler
Copy link

lheisler commented Feb 10, 2025

even with threading, this runs very long.

for vardict, and for the neoantigen pipeline in particular, we should be okay to exclude intergenic regions

I tested this out with a bedfile from UCSC knowngenes, splitting the bed file by chromosome, and a simplified command.
jobs completed in the range of 3 m to 1258 m (20 h)

test command was
varDict -G REF -N MoHQ-CM-1-180.DT -b 'Tbam|Nbam' -c 1 -S 2 -E 3 -g 4 -th 8 -D -y knownGene.chr.bed
(the above is extracted from the current workflow command, which pipes to additional filtering commands. i removed the pipes for testing purposes)

my tests were with knownGene , but i think we can use the same bed file that is being used for TMB assessment

https://bitbucket.oicr.on.ca/projects/GSI/repos/interval-files/browse/accredited/MANE_Select_v1.3.bed

@gavin-peng
Copy link
Collaborator Author

Trying to run vardict with -th 8 and uses UCSC knowngene bed file, seeing speed much faster, chrY completed in 28 minutes, versus 33h 22m last time. Still running though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants