-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contamination issue #7
Comments
Hi Sebastien, Very informative comment. I think this can be easily fixed.
The current pipeline uses samtools v1.9 so might need to figure out a compatible pysam version. I can take a stab fixing it and I might need your help to test. Chenhao |
pushed to my own folk here. could you help to test? |
Hey, I've some difficulties running the script on my AWS instance, looks like the pipeline is not functional with last nextflow versions. I've asked someone who used the older version on NSCC to test it for me, hopefully this should work. If you want to quickly try on your side, you can use this sample. Previous version should remove more than 60% of the reads while the new one should remove less than 1%. I've been using the last T2T reference genome (GCF_009914755.1) but there should not be a lot of variation if you keep using hg19. Best, |
Hi Chenhao, it looks like your pipeline can't be used with the last version of nextflow and we have issue installing the last pysam version on NSCC. Would you be able to update your nextflow script to make it work with recent nextflow version? Regards, |
Hi, I need to know the exact issue because I have no problem with nextflow version. |
Hey, using the last nextflow version (22.10.1). First I had to change the tuple declarations:
But even with this I hit an issue:
I could try to downgrade my version to yours but I'm not sure it's the best. Thanks, |
Nevermind, I'm able to run the pipeline now, I will update you once I've results. |
Ok I confirm it's working, but I still had to modify the tuple information for the last nextflow version (it's a minor fix though). If someone needs to quickly check again:
Using this command:
Values may vary a bit depending of your reference sequence. Thank you Chenhao. |
Thank you for the feedback Sebastien. I will work out an updated pipeline for better compatibility. |
Hi Sebastien, I recently encountered a huge loss of reads using the same approach here comparing to kneaddata. I wonder what insert size you got for your sample. The ones I tested has insert size 200-350. Chenhao |
Hey, I tested multiple samples with different insert sizes. Usually the insert size was ~ 300bp, but sometimes lower. With 150bp paired-end reads, this means very low or even negative inner distance. The negative inner distance (meaning overlap between the two pairs) is not the primary source of the error, it just makes it worst. The main issue is that with default BWA parameters, a 19bp seed is not hard to find against a human genome. JS |
I found this paper: Do you think we should just use bowtie2 for this purpose? |
I have a quick experiment:
These 4 rows correspond to using Kneaddata (customized bowtie2), fastp+bowtie2+samtools+kneaddata database, fastp+bowtie2+samtools+GRCm38 database, fastp+bwa+samtools+GRCm38. This dataset is low biomass sample with lots of host DNA. |
With Niranjan, we end up agreeing using BWA with a more stringent filtering directly on the bam file. I don't think one mapper is very different than the other in this case. While they do have different approaches, the main problem I've seen is the difference of alignment encoding, with BWA being more lenient to accept a pair as properly paired while Bowtie and minimap2 look more stringent (and correct). Rather than which mapper you're using, it's more which options you're using and how you post-process your alignment which matters. We sometimes forget that mapper were originally designed for isolate and (human) population genomic purposes, and not decontamination or multi-mapping. But maybe we overthink this a bit too much. Just increasing BWA seed length was enough to fix most of the issue: It seems that JS |
Hey Chenhao, somehow I found out that the new version is not behaving correctly with some samples, leaking out some host reads during the decontamination process. Until I find the reason why, I think it would be safer to revert to the previous version. |
Hi Chenhao,
we discovered an issue with the contamination pipeline from this repo. Shortly, BWA default seed (k=19) allows 'random' alignments of bacterial reads against the human genome. This issue can be important when insert size is low with high reads overlap. I have some samples with more than 50% of raw reads miss-classified as human with this. I can send you an email with slides for more info if you want to.
Since some people in the lab keep using your pipeline for their analyses, I think it would be a nice idea to fix the issue in your original repo, update the CSB5 one once done. Issue is I don't know much about nextflow, I'm only using snakemake.
To fix this, I made a very simple python script to filter reads considered as human based on (subjective) 70% coverage and 70% identity. Based on my analysis, this remove most of the potential false positive.
For speed purpose, I did not implemented a second step check where I check if the mate is unmapped or not. But this can be fixed with samtools fastq using the
-s
option. Here is my snakemake shell line:Do you think you could quickly update your repo to reflect those changes? Let me know if you're too busy and I will try to see how to do it on my own.
Regards,
JS
The text was updated successfully, but these errors were encountered: