Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

Open
wants to merge 78 commits into
base: master
Choose a base branch
from

Conversation

NagaComBio
Copy link
Member

@NagaComBio NagaComBio commented May 26, 2021

Replacing Impute2 with Beagle

  • Changed the phasing routine. "Beagle" replaced the program "impute2". Files and tools were renamed accordingly.
  • Renamed all tools and files from "imputeGenotype" to "phaseGenotype" (and so on) as the subroutine does not actually perform imputation but phasing.

Support for hg19 and hg38 reference genomes

  • Hg38 reference genome GRCh38_decoy_ebv_phiX_alt_hla_chr.fa is supported now along with the older hg19.
  • Liftover are done for homozygous deletion artefacts and no-control 1K window files for now. Once we have more samples analyzed with hg38, we will replace them.

Summary by CodeRabbit

  • New Features
    • Major version update to 6.0.0 with significant changes including the replacement of the "impute2" program with "Beagle", renaming of tools and files, and addition of new mappability options.
    • Added support for the hg38 genome.
    • Introduced a new CNA type 'AMP'.
  • Bug Fixes
    • Fixed minor bugs in version 5.1.0 and 5.0.1.
  • Documentation
    • Updated instructions for downloading and processing reference genome files for GRCh38.
    • Updated method of phasing genotypes of SNP positions in the documentation.
  • Refactor
    • Updated paths and added new variables related to Beagle imputation in the ACEseq configuration.
  • Tests
    • Updated dependency version from COWorkflowsBasePlugin 1.2.1 to 1.4.2.
  • Style
    • Updated shebang from #!/usr/bin/python to #!/usr/bin/env python in several scripts for better portability.
  • Chores
    • Added the file "hg19_GRCh37_1000genomes" to the list of ignored patterns.

GWarsow and others added 30 commits November 14, 2018 15:23
…ute2 (including the impute2 binary). Also renamed tools and folders from 'impute' to 'phase' as we are not imputing but using Beagle to phase the haplotypes.
…ls such that there is no mentioning of the word 'impute' anywhere.
…nces.sh file for the Beagle references. Updated Documentation. Cleaned and renamed other files with respect to impute.
…nce files in the /icgc/ngs_share/assemblies folder
Conflicts:
	resources/configurationFiles/analysisCopyNumberEstimation.xml
@NagaComBio NagaComBio requested a review from tlkaufmann May 26, 2021 09:16
Conflicts:
	installation/downloadReferences.sh
Conflicts:
	ACEseqWorkflow_5.0.1.iml
	ACEseqWorkflow_5.1.0.iml
	ACEseqWorkflow_6.0.0.iml
	README.ACEseqWorkflow.txt
	resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh
	resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh
	resources/configurationFiles/analysisCopyNumberEstimation.xml
Copy link
Member Author

@NagaComBio NagaComBio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mature now and we could start merging this to master/main.

installation/GRCh38_related_files/calc_gc_content.py Outdated Show resolved Hide resolved
installation/GRCh38_related_files/calc_gc_content.py Outdated Show resolved Hide resolved
installation/GRCh38_related_files/calc_gc_content.py Outdated Show resolved Hide resolved
installation/GRCh38_related_files/beagle_vcf_to_bref.sh Outdated Show resolved Hide resolved
installation/GRCh38_related_files/calc_gc_content.py Outdated Show resolved Hide resolved
Copy link

coderabbitai bot commented Nov 10, 2023

Walkthrough

The changes primarily involve a major version update from 5.0.0 to 6.0.0 for the ACEseq workflow. The update includes replacing the "impute2" program with "Beagle" for phasing genotypes, renaming tools and files, updating dependencies, adding support for the GRCh38 reference genome, and introducing new features and bug fixes. The changes affect various aspects of the workflow, including file handling, data processing, documentation, and scripts for downloading and processing reference genome files.

Changes

File(s) Summary
.gitignore Added "hg19_GRCh37_1000genomes" to the list of ignored patterns.
README.ACEseqWorkflow.txt Updated the ACEseq workflow to version 6.0.0, replacing "impute2" with "Beagle", renaming tools and files, updating dependencies, adding GRCh38 support, and introducing new features and bug fixes.
buildinfo.txt, buildversion.txt Updated the dependency version and the ACEseq workflow version.
documentation/source/methods.rst, documentation/source/parameters.rst Updated the method of phasing genotypes and the ACEseq configuration.
installation/GRCh38_related_files/* Added scripts and instructions for downloading and processing reference genome files for GRCh38.
installation/downloadReferences.sh Fixed indentation and spacing.
resources/analysisTools/copyNumberEstimationWorkflow/* Updated scripts for the ACEseq workflow, including changes to file handling, data processing, and command executions.
src/de/dkfz/b080/co/* Updated the ACEseq workflow version in the Java and Groovy source code.

Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai help to get help.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 28

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 9065164 and ab85dc9.
Files ignored due to filter (3)
  • ACEseqWorkflow.jar
  • resources/configurationFiles/analysisCopyNumberEstimation.xml
  • resources/configurationFiles/analysisCopyNumberEstimation_GRCh38.xml
Files selected for processing (57)
  • .gitignore (1 hunks)
  • README.ACEseqWorkflow.txt (1 hunks)
  • buildinfo.txt (1 hunks)
  • buildversion.txt (1 hunks)
  • documentation/source/methods.rst (1 hunks)
  • documentation/source/parameters.rst (4 hunks)
  • installation/GRCh38_related_files/README_downloadReferences_GRCh38.md (1 hunks)
  • installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1 hunks)
  • installation/GRCh38_related_files/calc_gc_content_bed.py (1 hunks)
  • installation/GRCh38_related_files/create_mappability.sh (1 hunks)
  • installation/downloadReferences.sh (3 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/datatablePSCBSgaps.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (2 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (2 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (3 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/homozygDel.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (2 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/plots.sh (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (7 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (3 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (5 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (2 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py (1 hunks)
  • resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1 hunks)
  • src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java (1 hunks)
  • src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (2 hunks)
Files not processed due to max files limit (3)
  • src/de/dkfz/b080/co/aceseq/ACESeqWorkflow.java
  • src/de/dkfz/b080/co/aceseq/ACEseqConstants.java
  • src/de/dkfz/b080/co/files/PhaseGenotypeByChromosome.java
Files skipped from review due to trivial changes (27)
  • .gitignore
  • buildversion.txt
  • installation/GRCh38_related_files/README_downloadReferences_GRCh38.md
  • installation/downloadReferences.sh
  • resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py
  • resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py
  • resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py
  • resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py
  • resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt
  • resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py
  • resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh
  • resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py
  • resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py
  • resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py
  • resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py
  • resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py
  • resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py
  • resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py
  • resources/analysisTools/copyNumberEstimationWorkflow/plots.sh
  • resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R
  • resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py
  • resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py
  • resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py
  • resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py
  • resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py
  • resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py
  • src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java
Additional comments: 83
resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1)
  • 12-15: Ensure that the Perl script $TOOL_PARSE_VCF can handle the new argument $CHR_PREFIX and that all invocations of this script have been updated accordingly.
resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1)
  • 1-11: The dieWith function is a good addition for error handling in bash scripts. It takes an error message and an optional exit code, prints the error message to stderr, and then exits with the provided exit code. This is a good practice for error handling in bash scripts.
resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1)
  • 1-8: The new data table looks fine. Ensure that the chromosome identifiers, start and end positions, and frequency values are correct and consistent with the rest of your data.
README.ACEseqWorkflow.txt (1)
  • 15-40: The changelog is well-documented and provides a clear understanding of the changes made in each version. Ensure that all the changes mentioned are reflected in the code and the documentation.
documentation/source/methods.rst (1)
  • 2-8: The documentation has been updated to reflect the change in the method of phasing genotypes of SNP positions, switching from Impute2 to Beagle. This change affects the assignment of A and B alleles and impacts the estimation of B-allele frequencies (BAFs) for SNP positions in tumor and control samples.
resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (4)
  • 10-13: Ensure that the "beagle" module is available and the version is correctly specified in the environment variable $BEAGLE_VERSION.

  • 15-15: The path to the virtual environment has been changed. Make sure the new path is correct and the virtual environment is properly set up at this location.

  • 38-39: Ensure that the "bcftools" module is available and the version is correctly specified in the environment variable $BCFTOOLS_VERSION.

  • 41-42: Ensure that the "java" module is available and the version is correctly specified in the environment variable $JAVA_VERSION.

buildinfo.txt (1)
  • 1-1: Ensure that the updated version of COWorkflowsBasePlugin (1.4.2) is compatible with the current versions of RoddyAPIVersion (3.2), GroovyVersion (2.4), and JDKVersion (1.8). Also, verify that all the dependencies of COWorkflowsBasePlugin 1.4.2 are met in the current environment.
resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1)
  • 28-28: The --comment chr option is added to the ${TABIX_BINARY} command. Ensure that this change is compatible with the expected input and output of the ${TABIX_BINARY} command in the context of the overall workflow.
- ${TABIX_BINARY} -s 1 -b 2 -e 2 $tmp_snp_filename
+ ${TABIX_BINARY} -s 1 -b 2 -e 2 --comment chr $tmp_snp_filename
resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1)
  • 6-20: The changes made to the script are good for ensuring proper variable expansion and command execution. The use of curly braces for variable interpolation and quotes for all variable references are best practices in shell scripting. This helps to prevent word splitting and pathname expansion.
src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (4)
  • 27-31: The getGlobalJobSpecificParameters function now returns a LinkedHashMap with only one key-value pair. Ensure that this change does not affect other parts of the code that use this function.

  • 33-33: The generateCNVSNPs function has its return type changed to CnvSnpGeneratorResultByType. Make sure that all calls to this function are updated to handle the new return type.

  • 55-64: The imputeGenotypes function is renamed to phaseGenotypes, and its return type is changed to PhaseGenotypeByChromosome. Ensure that all calls to this function are updated to match the new function name and handle the new return type.

  • 80-86: The imputeGenotypeX function is renamed to phaseGenotypeX. Ensure that all calls to this function are updated to match the new function name.

resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (1)
  • 1-1: The shebang has been updated to use #!/usr/bin/env python for better portability across different environments. This is a good practice as it allows the script to run in environments where Python might not be installed in the standard location.
resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (4)
  • 19-25: The variable mostImportantFile is assigned but never used. If it's not needed, consider removing it to improve code readability.

  • 90-111: Ensure that the new file paths and variables are correctly set and accessible. Also, verify that the new arguments for the script are correctly passed and handled in the script.

  • 129-132: Ensure that the temporary files being moved have been created successfully in the previous steps. If the creation of these files fails, the mv command will also fail.

  • 133-133: Ensure that ${combProFile}.tmp is no longer needed before removing it. If it's needed later in the code, this line will cause issues.

resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (3)
  • 1-1: The shebang line is correctly formatted without leading spaces.

  • 8-10: The function is_comment_line is a good abstraction that improves readability.

  • 26-29: The check for comment lines using the new function improves readability and maintainability.

resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1)
  • 368-368: Ensure that the Mclust function can handle the verbose argument being a boolean. If it expects a string, you may need to convert beVerbose to a string before passing it to the function.
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (5)
  • 19-24: The new parameter annotatePlotsWithGenes is added to the specification matrix for the getopt function. Ensure that this parameter is being passed correctly wherever this script is being called.

  • 66-70: The annotatePlotsWithGenes parameter is being converted to a boolean value. This is a good practice for clarity and type safety.

  • 107-110: The annotatePlotsWithGenes parameter is added to the plotChromosomes function. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.

  • 129-132: The annotatePlotsWithGenes parameter is being used in the plotTCN function call within the plotChromosomes function. Ensure that the plotTCN function has been updated to handle this parameter.

  • 284-298: The annotatePlotsWithGenes parameter is being used in the plotChromosomes function call within the loop. This is a good use of the parameter.

installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1)
  • 1-14: This script assumes that the required Java version and the bref3.18May20.d20.jar file are available in the current environment. Please ensure that these dependencies are properly installed and accessible.
installation/GRCh38_related_files/create_mappability.sh (1)
  • 2-2: The set -ue -o pipefail command is a good practice in bash scripts. It ensures that the script exits if any command fails (set -e), if any variable is not set (set -u), and if any part of a pipe fails (set -o pipefail).
installation/GRCh38_related_files/calc_gc_content_bed.py (10)
  • 1-5: The script starts with a shebang line and imports necessary modules. It's good practice to include a shebang line at the top of the script, which allows the script to be run as a standalone executable.

  • 6-6: The script checks for Python 3, which is a good practice to ensure compatibility.

  • 8-14: The script uses argparse to handle command-line arguments, which is a standard and recommended way to do so in Python.

  • 16-17: The script converts command-line arguments to appropriate data types. This is a good practice to ensure that the data types are as expected.

  • 19-20: The script initializes an output data list with a header line. This is a good practice for creating output files that are easy to understand.

  • 21-24: The script reads the input file and splits it into lines. This is a standard way to read a file in Python.

  • 25-33: The script calculates chromosome stats. It's good to see that the script is modular and each part of the code is responsible for a specific task.

  • 34-38: The script uses a regular expression to filter chromosome names. This is a good practice to ensure that only valid chromosome names are processed.

  • 39-54: The script calculates GC content for each region and appends the results to the output data list. This is the main functionality of the script and it seems to be implemented correctly.

  • 56-59: The script writes the output data to a file. This is a standard way to write data to a file in Python.

documentation/source/parameters.rst (4)
  • 11-17: The imputeOutputDirectory has been renamed to phasingOutputDirectory. Ensure that all references to this directory in the codebase have been updated accordingly.

  • 42-48: The haplogroupFilePath now uses the phasingOutputDirectory instead of the imputeOutputDirectory. Ensure that the file generation and usage are updated accordingly.

  • 73-81: The FILE_UNPHASED_PRE and FILE_PHASED_GENOTYPE paths now use the phasingOutputDirectory instead of the imputeOutputDirectory. Ensure that the file generation and usage are updated accordingly.

  • 96-108: New variables BEAGLE_REFERENCE_FILE, BEAGLE_REFERENCE_FILE_X, BEAGLE_GENETIC_MAP, and BEAGLE_GENETIC_MAP_X have been introduced. Ensure that these are used correctly in the codebase and that the paths are valid.

resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (7)
  • 2-2: The script is set to fail on any error or unset variable, which is a good practice for robustness and error handling.

  • 16-33: The script checks if it's running in a no-control workflow and if not, it uses bcftools to generate an unphased VCF file from the control BAM file. Error handling is done properly with a custom function dieWith.

  • 39-43: The script creates a fake sample using a Python script. Error handling is done properly with a custom function dieWith.

  • 44-53: The script sets Java options for Beagle and then runs Beagle for phasing. Error handling is done properly with a custom function dieWith.

  • 55-59: The script embeds haplotypes into the VCF file using a Python script. Error handling is done properly with a custom function dieWith.

  • 61-65: The script groups haplotypes using a Python script. Error handling is done properly with a custom function dieWith.

  • 67-68: The script moves temporary files to their final destinations. It's a good practice to use temporary files to avoid partial writes in case of errors.

resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5)
  • 17-30: The new argument cutoff is optional and defaults to 0.7 if not provided. Ensure that all calls to this script throughout the codebase have been updated to match the new signature if necessary.

  • 97-103: The script now excludes chromosomes that have not changed from the merged data frame. It also writes the indices of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH to the contributingSegmentsFile_HRD_File.

  • 144-157: The script now writes the reduced merged data frame to merged_reduced_outputFile. It also calculates the number of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH or DEL in both the original and reduced merged data frames. These indices are written to contributingSegmentsFile_HRD_File and a new file with the suffix .CentromerReduced.txt.

  • 212-218: A new variable LST_SEGMENTS_INDICES is introduced to store indices of segments in the reduced merged data frame that meet a specific condition.

  • 224-233: The script now writes the indices stored in LST_SEGMENTS_INDICES to contributingSegments_LST_File if LSTReduced is greater than 0.

resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1)
  • 43-45: The script is removing lines containing "NULL" from the file. This could potentially remove valid data if "NULL" appears in the data itself. Ensure that "NULL" only appears in invalid lines.
resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (1)
  • 24-30: The code modifies the first element of @fields by prepending $ARGV[3] if $filehandle{$fields[0]} exists. Ensure that this change doesn't break any dependencies on the original value of $fields[0] elsewhere in the code.
resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (5)
  • 1-1: Ensure that the Python interpreter is available in the environment where this script will be run.

  • 3-3: Good use of the argparse module for command-line argument parsing.

  • 5-7: The function is_comment_line is correctly implemented to check if a line is a comment.

  • 8-11: The command-line arguments are correctly parsed and assigned to variables.

  • 16-29: The script correctly modifies the input VCF file by adding a new column and duplicating the 10th column's value to the end of each line.

resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (2)
  • 33-34: The script replaces 'chr' and 'X' in the chromosome field of the VCF file. This may not be applicable to all VCF files and could lead to incorrect results if the VCF file does not follow this naming convention. Ensure that this operation is necessary and correct for all input files.

  • 46-54: The script modifies the genotype field of the VCF file based on the haplotype data. This operation assumes that the format fields of the VCF and haplotype files are in the same order, which may not always be the case. Consider adding checks to ensure that the format fields match.

resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (6)
  • 313-317: The new condition "quantile" for dh_Stop is added correctly. Ensure that the new condition is handled wherever dh_Stop is used in the codebase.

  • 540-540: The assignment of limitDHOrig to limitDH is correct. Ensure that limitDHOrig is used correctly in the rest of the codebase.

  • 634-637: The error message and termination if all data points are equal to 6 is a good addition for error handling. Ensure that the error message is clear and informative for the user.

  • 764-772: The addition of new variables plotPurities_DH and plotPurities_NTCN is correct. Ensure that these variables are used correctly in the rest of the codebase.

  • 777-777: The modification of the df.lines data frame to include purity_DH and purity_NTCN is correct. Ensure that these new columns are used correctly in the rest of the codebase.

  • 816-818: The modification of the ggplot function to include new points for purity_DH and purity_NTCN is correct. Ensure that these new points are displayed correctly in the plot.

resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (9)
  • 8-11: Ensure that the isNoControlWorkflow variable is set correctly before this script is run. Also, verify that the FILE_CONTROL_BAM file exists and is accessible.

  • 24-31: The script exits if the patient is not female or klinefelter. Ensure that this behavior is expected and acceptable in all use cases.

  • 33-45: This block of code is executed only if isNoControlWorkflow is false. Ensure that the BCFTOOLS_BINARY, CNV_MPILEUP_OPTS, REFERENCE_GENOME, CHR_NR, FILE_CONTROL_BAM, and BCFTOOLS_OPTS variables are set correctly before this script is run. Also, verify that the BCFTOOLS_BINARY tool is installed and accessible.

  • 47-49: These lines are creating empty files. Ensure that the script has write permissions to the directory where these files are being created.

  • 51-54: Ensure that the PYTHON_BINARY, TOOL_BEAGLE_CREATE_FAKE_SAMPLES, and UNPHASED variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

  • 61-70: Ensure that the JAVA_OPTIONS, BEAGLE_BINARY, UNPHASED_TWOSAMPLES, BEAGLE_REFERENCE_FILE_X, PHASED_TWOSAMPLES, BEAGLE_GENETIC_MAP_X, and BEAGLE_NUM_THREAD variables are set correctly before this script is run. Also, verify that the BEAGLE_BINARY tool is installed and accessible.

  • 73-77: Ensure that the PYTHON_BINARY, TOOL_BEAGLE_EMBED_HAPLOTYPES_VCF, PHASED_TWOSAMPLES, UNPHASED, and tmpPhased variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

  • 80-84: Ensure that the PYTHON_BINARY, TOOL_GROUP_HAPLOTYPES, tmpPhased, tmpHaploblocks, and minHT variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

  • 87-88: These lines are moving temporary files to their final locations. Ensure that the script has write permissions to the directory where these files are being moved.

resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (1)
  • 113-115: The code adds gene labels to the plot. Ensure that the labels do not overlap or clutter the plot, which could make it difficult to read.

Comment on lines 6 to 23

tmp_corrected_windowfile="$FILENAME_GC_CORRECTED_WINDOWS.tmp"

$RSCRIPT_BINARY "$TOOL_CORRECT_GC_BIAS_R \
--windowFile "$cnvSnpOutputDirectory/$PID".all.cnv.chr*.tab \
--timefile "$REPLICATION_TIME_FILE" \
--chrLengthFile $CHROMOSOME_LENGTH_FILE" \
--pid "$PID" \
--email "$EMAIL" \
--outfile "$tmp_corrected_windowfile" \
--corPlot "$FILENAME_GC_CORRECT_PLOT" \
--gcFile "$GC_CONTENT_FILE" \
--outDir "$aceseqOutputDirectory" \
--lowess_f "$LOWESS_F" \
--scaleFactor "$SCALE_FACTOR"
${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" \
--windowFile "${cnvSnpOutputDirectory}/${PID}.all.cnv.chr*.tab" \
--timefile "${REPLICATION_TIME_FILE}" \
--chrLengthFile "${CHROMOSOME_LENGTH_FILE}" \
--pid "${PID}" \
--email "${EMAIL}" \
--outfile "${tmp_corrected_windowfile}" \
--corPlot "${FILENAME_GC_CORRECT_PLOT}" \
--gcFile "${GC_CONTENT_FILE}" \
--outDir "${aceseqOutputDirectory}" \
--lowess_f "${LOWESS_F}" \
--scaleFactor "${SCALE_FACTOR}"



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was outside of the patch, but no patch was found that overlapped with it. Original lines [24-28]

The error handling is good, but it could be improved by providing more specific error messages. For example, you could capture the error output of the command and include it in the error message. This would help with debugging if something goes wrong.

- if [[ $? != 0 ]]
+ if ! output=$(${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" ... 2>&1); then
-    echo "Something went wrong during GC correction. Program had non-zero exit status, exiting pipeline...\n\n"
+    echo "Error during GC correction: $output"
     exit 2
  fi

Comment on lines 44 to +47

mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}

$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"
$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --comment chromosome option in the $TABIX_BINARY command is used to specify that lines starting with 'chromosome' are treated as comments and ignored. Ensure that this is the intended behavior and that it doesn't affect the processing of the input file.

- $TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"
+ $TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}
$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"
$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"
mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}
$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"

@@ -235,6 +235,6 @@ def merge_first_segment(prior_line, newline):
newline=None

#print last line(s)
out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )
out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" )
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior_line variable is being indexed with 0 which might cause a TypeError if prior_line is not a list. Ensure that prior_line is always a list with at least one element when this line is executed.

- out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" ) 
+ out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" ) 

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" )
out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )

Comment on lines +363 to +367
if (runInDebugMode == "true") {
beVerbose = T
} else {
beVerbose = F
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional assignment to beVerbose can be simplified to a single line. This reduces complexity and improves readability.

-        if (runInDebugMode == "true") {
-          beVerbose = T
-        } else {
-          beVerbose = F
-        }
+        beVerbose = runInDebugMode == "true"

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
if (runInDebugMode == "true") {
beVerbose = T
} else {
beVerbose = F
}
beVerbose = runInDebugMode == "true"

Comment on lines +14 to +17
if args.hap_file.split('.')[-1] == 'gz':
hap_file = gzip.open( args.hap_file, "r" )
else:
hap_file = open( args.hap_file, "r" )
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script opens files but does not close them. This can lead to resource leaks. Consider using the with statement to ensure files are properly closed after they are no longer needed.

with open(args.hap_file, "r") as hap_file:
    # rest of the code

Comment on lines +38 to +43
while hap_line and int(hap_line[1]) < int(vcf_line[1]):
hap_line = hap_file.readline()

if hap_line:
hap_line = hap_line.rstrip().split()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script reads lines from the haplotype file in a loop until it finds a line with a position that matches the current VCF line. If the haplotype file is large, this could be inefficient. Consider using a more efficient data structure, such as a dictionary or a binary search tree, to store the haplotype data.

if is_comment_line(vcf_line):
if vcf_line.startswith("#CHROM"):
vcf_line = vcf_line.rstrip().split("\t")
vcf_line.append('sample0')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "fake" sample name 'sample0' is hardcoded. Consider making this configurable through a command-line argument for flexibility.

- parser.add_argument('--out_file', help='Out .vcf file')
+ parser.add_argument('--out_file', help='Out .vcf file')
+ parser.add_argument('--sample_name', default='sample0', help='Sample name to be added')
...
- vcf_line.append('sample0')
+ vcf_line.append(args.sample_name)

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
vcf_line.append('sample0')
parser.add_argument('--out_file', help='Out .vcf file')
parser.add_argument('--sample_name', default='sample0', help='Sample name to be added')
...
vcf_line.append(args.sample_name)

Comment on lines +83 to +117
if (annotatePlotsWithGenes & is.data.frame(geneAnnotations)) {
genesSubset = geneAnnotations[geneAnnotations$chr==chromosome,]
genesSubset = genesSubset[order(genesSubset$start),]

if (nrow(genesSubset)>0) {
genesSubset$yOffset = 0
# determine yPos offset
if (nrow(genesSubset)>1) {
genesSubset$distance = c(NA,sapply(2:nrow(genesSubset), function(j) {
genesSubset[(j),"start"] - genesSubset[(j-1),"start"]
}))
genesSubset$yHasOffset = genesSubset$distance < 13e6
genesSubset$yHasOffset[is.na(genesSubset$yHasOffset)] = F
for (j in 2:nrow(genesSubset)) {
if (genesSubset[j,"yHasOffset"]) {
genesSubset[j,"yOffset"] = genesSubset[j-1,"yOffset"]+1
}
}
}

genesSubset$position = (genesSubset$start+(genesSubset$end - genesSubset$start)/2)/10/xtotal

if (!unconditionalGeneLabeling) {
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
merged = mergeByOverlaps(genesSubset.gr, seg.gr)
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
}
if (nrow(genesSubset)>0) {
p <- p + geom_vline( xintercept=genesSubset$position, col = "grey80", lty=5, lwd =0.2 )
p <- p + geom_text(data=genesSubset, aes(x=position, y=ymaxcov+0.2-0.08*ymaxcov*yOffset, label = gene), cex = 4, col = "red")
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code block for gene annotation is quite large and complex. Consider breaking it down into smaller, more manageable functions for better readability and maintainability.

Comment on lines +106 to +111
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
merged = mergeByOverlaps(genesSubset.gr, seg.gr)
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks for unconditionalGeneLabeling and then performs operations if it's not set. This could be confusing. Consider reversing the logic for better readability.

- if (!unconditionalGeneLabeling) {
+ if (unconditionalGeneLabeling) {
+   # Skip the operations
+ } else {

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
merged = mergeByOverlaps(genesSubset.gr, seg.gr)
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
}
if (unconditionalGeneLabeling) {
# Skip the operations
} else {
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
merged = mergeByOverlaps(genesSubset.gr, seg.gr)
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, can you replace the ARGV access by something more semantic?

my $phasingOutDir, $unphasedGenotypesFilePrefix, $unphasedGenotypesFileSuffix, $chrPrefix = @ARGV;

Then also CodeRabbit's correct error remark will disappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants