Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

NagaComBio · 2021-05-26T09:16:49Z

Replacing Impute2 with Beagle

Changed the phasing routine. "Beagle" replaced the program "impute2". Files and tools were renamed accordingly.
Renamed all tools and files from "imputeGenotype" to "phaseGenotype" (and so on) as the subroutine does not actually perform imputation but phasing.

Support for hg19 and hg38 reference genomes

Hg38 reference genome GRCh38_decoy_ebv_phiX_alt_hla_chr.fa is supported now along with the older hg19.
Liftover are done for homozygous deletion artefacts and no-control 1K window files for now. Once we have more samples analyzed with hg38, we will replace them.

Summary by CodeRabbit

New Features
- Major version update to 6.0.0 with significant changes including the replacement of the "impute2" program with "Beagle", renaming of tools and files, and addition of new mappability options.
- Added support for the hg38 genome.
- Introduced a new CNA type 'AMP'.
Bug Fixes
- Fixed minor bugs in version 5.1.0 and 5.0.1.
Documentation
- Updated instructions for downloading and processing reference genome files for GRCh38.
- Updated method of phasing genotypes of SNP positions in the documentation.
Refactor
- Updated paths and added new variables related to Beagle imputation in the ACEseq configuration.
Tests
- Updated dependency version from COWorkflowsBasePlugin 1.2.1 to 1.4.2.
Style
- Updated shebang from #!/usr/bin/python to #!/usr/bin/env python in several scripts for better portability.
Chores
- Added the file "hg19_GRCh37_1000genomes" to the list of ignored patterns.

…line

… purity issue')

…fault]

…ute2 (including the impute2 binary). Also renamed tools and folders from 'impute' to 'phase' as we are not imputing but using Beagle to phase the haplotypes.

…ls such that there is no mentioning of the word 'impute' anywhere.

…nces.sh file for the Beagle references. Updated Documentation. Cleaned and renamed other files with respect to impute.

…nce files in the /icgc/ngs_share/assemblies folder

…onfig file

Conflicts: resources/configurationFiles/analysisCopyNumberEstimation.xml

…ob is interrupted

documentation/source/methods.rst

documentation/source/parameters.rst

installation/downloadReferences.sh

resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh

resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py

resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh

resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh

resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh

resources/configurationFiles/analysisCopyNumberEstimation.xml

resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh

…to hg38

installation/GRCh38_related_files/README_downloadReferences_GRCh38.md

installation/downloadReferences.sh

installation/GRCh38_related_files/beagle_vcf_to_bref.sh

installation/GRCh38_related_files/calc_gc_content.py

installation/GRCh38_related_files/create_mappability.sh

resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh

Conflicts: installation/downloadReferences.sh

Conflicts: ACEseqWorkflow_5.0.1.iml ACEseqWorkflow_5.1.0.iml ACEseqWorkflow_6.0.0.iml README.ACEseqWorkflow.txt resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh resources/configurationFiles/analysisCopyNumberEstimation.xml

NagaComBio

This is mature now and we could start merging this to master/main.

installation/GRCh38_related_files/calc_gc_content.py

installation/GRCh38_related_files/beagle_vcf_to_bref.sh

installation/GRCh38_related_files/calc_gc_content.py

coderabbitai · 2023-11-10T08:31:35Z

Walkthrough

The changes primarily involve a major version update from 5.0.0 to 6.0.0 for the ACEseq workflow. The update includes replacing the "impute2" program with "Beagle" for phasing genotypes, renaming tools and files, updating dependencies, adding support for the GRCh38 reference genome, and introducing new features and bug fixes. The changes affect various aspects of the workflow, including file handling, data processing, documentation, and scripts for downloading and processing reference genome files.

Changes

File(s)	Summary
`.gitignore`	Added "hg19_GRCh37_1000genomes" to the list of ignored patterns.
`README.ACEseqWorkflow.txt`	Updated the ACEseq workflow to version 6.0.0, replacing "impute2" with "Beagle", renaming tools and files, updating dependencies, adding GRCh38 support, and introducing new features and bug fixes.
`buildinfo.txt`, `buildversion.txt`	Updated the dependency version and the ACEseq workflow version.
`documentation/source/methods.rst`, `documentation/source/parameters.rst`	Updated the method of phasing genotypes and the ACEseq configuration.
`installation/GRCh38_related_files/*`	Added scripts and instructions for downloading and processing reference genome files for GRCh38.
`installation/downloadReferences.sh`	Fixed indentation and spacing.
`resources/analysisTools/copyNumberEstimationWorkflow/*`	Updated scripts for the ACEseq workflow, including changes to file handling, data processing, and command executions.
`src/de/dkfz/b080/co/*`	Updated the ACEseq workflow version in the Java and Groovy source code.

Tips

Chat with CodeRabbit Bot (`@coderabbitai`)

If you reply to a review comment from CodeRabbit, the bot will automatically respond.
To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
Note: Review comments are made on code diffs or files, not on the PR overview.
Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai help to get help.
@coderabbitai resolve to resolve all the CodeRabbit review comments.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

coderabbitai

Review Status

Actionable comments generated: 28

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 9065164 and ab85dc9.

Files ignored due to filter (3)

ACEseqWorkflow.jar
resources/configurationFiles/analysisCopyNumberEstimation.xml
resources/configurationFiles/analysisCopyNumberEstimation_GRCh38.xml

Files selected for processing (57)

.gitignore (1 hunks)
README.ACEseqWorkflow.txt (1 hunks)
buildinfo.txt (1 hunks)
buildversion.txt (1 hunks)
documentation/source/methods.rst (1 hunks)
documentation/source/parameters.rst (4 hunks)
installation/GRCh38_related_files/README_downloadReferences_GRCh38.md (1 hunks)
installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1 hunks)
installation/GRCh38_related_files/calc_gc_content_bed.py (1 hunks)
installation/GRCh38_related_files/create_mappability.sh (1 hunks)
installation/downloadReferences.sh (3 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/datatablePSCBSgaps.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (2 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (2 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (3 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/homozygDel.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (2 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/plots.sh (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (7 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (3 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (5 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (2 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py (1 hunks)
resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1 hunks)
src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java (1 hunks)
src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (2 hunks)

Files not processed due to max files limit (3)

src/de/dkfz/b080/co/aceseq/ACESeqWorkflow.java
src/de/dkfz/b080/co/aceseq/ACEseqConstants.java
src/de/dkfz/b080/co/files/PhaseGenotypeByChromosome.java

Files skipped from review due to trivial changes (27)

.gitignore
buildversion.txt
installation/GRCh38_related_files/README_downloadReferences_GRCh38.md
installation/downloadReferences.sh
resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py
resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py
resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py
resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py
resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt
resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py
resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh
resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py
resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py
resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py
resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py
resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py
resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py
resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py
resources/analysisTools/copyNumberEstimationWorkflow/plots.sh
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R
resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py
resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py
resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py
resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py
resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py
resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py
src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java

Additional comments: 83

resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1)

12-15: Ensure that the Perl script $TOOL_PARSE_VCF can handle the new argument $CHR_PREFIX and that all invocations of this script have been updated accordingly.

resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1)

1-11: The dieWith function is a good addition for error handling in bash scripts. It takes an error message and an optional exit code, prints the error message to stderr, and then exits with the provided exit code. This is a good practice for error handling in bash scripts.

resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1)

1-8: The new data table looks fine. Ensure that the chromosome identifiers, start and end positions, and frequency values are correct and consistent with the rest of your data.

README.ACEseqWorkflow.txt (1)

15-40: The changelog is well-documented and provides a clear understanding of the changes made in each version. Ensure that all the changes mentioned are reflected in the code and the documentation.

documentation/source/methods.rst (1)

2-8: The documentation has been updated to reflect the change in the method of phasing genotypes of SNP positions, switching from Impute2 to Beagle. This change affects the assignment of A and B alleles and impacts the estimation of B-allele frequencies (BAFs) for SNP positions in tumor and control samples.

resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (4)

10-13: Ensure that the "beagle" module is available and the version is correctly specified in the environment variable $BEAGLE_VERSION.

15-15: The path to the virtual environment has been changed. Make sure the new path is correct and the virtual environment is properly set up at this location.

38-39: Ensure that the "bcftools" module is available and the version is correctly specified in the environment variable $BCFTOOLS_VERSION.

41-42: Ensure that the "java" module is available and the version is correctly specified in the environment variable $JAVA_VERSION.

buildinfo.txt (1)

1-1: Ensure that the updated version of COWorkflowsBasePlugin (1.4.2) is compatible with the current versions of RoddyAPIVersion (3.2), GroovyVersion (2.4), and JDKVersion (1.8). Also, verify that all the dependencies of COWorkflowsBasePlugin 1.4.2 are met in the current environment.
resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1)
28-28: The --comment chr option is added to the ${TABIX_BINARY} command. Ensure that this change is compatible with the expected input and output of the ${TABIX_BINARY} command in the context of the overall workflow.
- ${TABIX_BINARY} -s 1 -b 2 -e 2 $tmp_snp_filename
+ ${TABIX_BINARY} -s 1 -b 2 -e 2 --comment chr $tmp_snp_filename
resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1)

6-20: The changes made to the script are good for ensuring proper variable expansion and command execution. The use of curly braces for variable interpolation and quotes for all variable references are best practices in shell scripting. This helps to prevent word splitting and pathname expansion.

src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (4)

27-31: The getGlobalJobSpecificParameters function now returns a LinkedHashMap with only one key-value pair. Ensure that this change does not affect other parts of the code that use this function.

33-33: The generateCNVSNPs function has its return type changed to CnvSnpGeneratorResultByType. Make sure that all calls to this function are updated to handle the new return type.

55-64: The imputeGenotypes function is renamed to phaseGenotypes, and its return type is changed to PhaseGenotypeByChromosome. Ensure that all calls to this function are updated to match the new function name and handle the new return type.

80-86: The imputeGenotypeX function is renamed to phaseGenotypeX. Ensure that all calls to this function are updated to match the new function name.

resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (1)

1-1: The shebang has been updated to use #!/usr/bin/env python for better portability across different environments. This is a good practice as it allows the script to run in environments where Python might not be installed in the standard location.

resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (4)

19-25: The variable mostImportantFile is assigned but never used. If it's not needed, consider removing it to improve code readability.

90-111: Ensure that the new file paths and variables are correctly set and accessible. Also, verify that the new arguments for the script are correctly passed and handled in the script.

129-132: Ensure that the temporary files being moved have been created successfully in the previous steps. If the creation of these files fails, the mv command will also fail.

133-133: Ensure that ${combProFile}.tmp is no longer needed before removing it. If it's needed later in the code, this line will cause issues.

resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (3)

1-1: The shebang line is correctly formatted without leading spaces.

8-10: The function is_comment_line is a good abstraction that improves readability.

26-29: The check for comment lines using the new function improves readability and maintainability.

resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1)

368-368: Ensure that the Mclust function can handle the verbose argument being a boolean. If it expects a string, you may need to convert beVerbose to a string before passing it to the function.

resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (5)

19-24: The new parameter annotatePlotsWithGenes is added to the specification matrix for the getopt function. Ensure that this parameter is being passed correctly wherever this script is being called.

66-70: The annotatePlotsWithGenes parameter is being converted to a boolean value. This is a good practice for clarity and type safety.

107-110: The annotatePlotsWithGenes parameter is added to the plotChromosomes function. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.

129-132: The annotatePlotsWithGenes parameter is being used in the plotTCN function call within the plotChromosomes function. Ensure that the plotTCN function has been updated to handle this parameter.

284-298: The annotatePlotsWithGenes parameter is being used in the plotChromosomes function call within the loop. This is a good use of the parameter.

installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1)

1-14: This script assumes that the required Java version and the bref3.18May20.d20.jar file are available in the current environment. Please ensure that these dependencies are properly installed and accessible.

installation/GRCh38_related_files/create_mappability.sh (1)

2-2: The set -ue -o pipefail command is a good practice in bash scripts. It ensures that the script exits if any command fails (set -e), if any variable is not set (set -u), and if any part of a pipe fails (set -o pipefail).

installation/GRCh38_related_files/calc_gc_content_bed.py (10)

1-5: The script starts with a shebang line and imports necessary modules. It's good practice to include a shebang line at the top of the script, which allows the script to be run as a standalone executable.

6-6: The script checks for Python 3, which is a good practice to ensure compatibility.

8-14: The script uses argparse to handle command-line arguments, which is a standard and recommended way to do so in Python.

16-17: The script converts command-line arguments to appropriate data types. This is a good practice to ensure that the data types are as expected.

19-20: The script initializes an output data list with a header line. This is a good practice for creating output files that are easy to understand.

21-24: The script reads the input file and splits it into lines. This is a standard way to read a file in Python.

25-33: The script calculates chromosome stats. It's good to see that the script is modular and each part of the code is responsible for a specific task.

34-38: The script uses a regular expression to filter chromosome names. This is a good practice to ensure that only valid chromosome names are processed.

39-54: The script calculates GC content for each region and appends the results to the output data list. This is the main functionality of the script and it seems to be implemented correctly.

56-59: The script writes the output data to a file. This is a standard way to write data to a file in Python.

documentation/source/parameters.rst (4)

11-17: The imputeOutputDirectory has been renamed to phasingOutputDirectory. Ensure that all references to this directory in the codebase have been updated accordingly.

42-48: The haplogroupFilePath now uses the phasingOutputDirectory instead of the imputeOutputDirectory. Ensure that the file generation and usage are updated accordingly.

73-81: The FILE_UNPHASED_PRE and FILE_PHASED_GENOTYPE paths now use the phasingOutputDirectory instead of the imputeOutputDirectory. Ensure that the file generation and usage are updated accordingly.

96-108: New variables BEAGLE_REFERENCE_FILE, BEAGLE_REFERENCE_FILE_X, BEAGLE_GENETIC_MAP, and BEAGLE_GENETIC_MAP_X have been introduced. Ensure that these are used correctly in the codebase and that the paths are valid.

resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (7)

2-2: The script is set to fail on any error or unset variable, which is a good practice for robustness and error handling.

16-33: The script checks if it's running in a no-control workflow and if not, it uses bcftools to generate an unphased VCF file from the control BAM file. Error handling is done properly with a custom function dieWith.

39-43: The script creates a fake sample using a Python script. Error handling is done properly with a custom function dieWith.

44-53: The script sets Java options for Beagle and then runs Beagle for phasing. Error handling is done properly with a custom function dieWith.

55-59: The script embeds haplotypes into the VCF file using a Python script. Error handling is done properly with a custom function dieWith.

61-65: The script groups haplotypes using a Python script. Error handling is done properly with a custom function dieWith.

67-68: The script moves temporary files to their final destinations. It's a good practice to use temporary files to avoid partial writes in case of errors.

resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5)

17-30: The new argument cutoff is optional and defaults to 0.7 if not provided. Ensure that all calls to this script throughout the codebase have been updated to match the new signature if necessary.

97-103: The script now excludes chromosomes that have not changed from the merged data frame. It also writes the indices of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH to the contributingSegmentsFile_HRD_File.

144-157: The script now writes the reduced merged data frame to merged_reduced_outputFile. It also calculates the number of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH or DEL in both the original and reduced merged data frames. These indices are written to contributingSegmentsFile_HRD_File and a new file with the suffix .CentromerReduced.txt.

212-218: A new variable LST_SEGMENTS_INDICES is introduced to store indices of segments in the reduced merged data frame that meet a specific condition.

224-233: The script now writes the indices stored in LST_SEGMENTS_INDICES to contributingSegments_LST_File if LSTReduced is greater than 0.

resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1)

43-45: The script is removing lines containing "NULL" from the file. This could potentially remove valid data if "NULL" appears in the data itself. Ensure that "NULL" only appears in invalid lines.

resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (1)

24-30: The code modifies the first element of @fields by prepending $ARGV[3] if $filehandle{$fields[0]} exists. Ensure that this change doesn't break any dependencies on the original value of $fields[0] elsewhere in the code.

resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (5)

1-1: Ensure that the Python interpreter is available in the environment where this script will be run.

3-3: Good use of the argparse module for command-line argument parsing.

5-7: The function is_comment_line is correctly implemented to check if a line is a comment.

8-11: The command-line arguments are correctly parsed and assigned to variables.

16-29: The script correctly modifies the input VCF file by adding a new column and duplicating the 10th column's value to the end of each line.

resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (2)

33-34: The script replaces 'chr' and 'X' in the chromosome field of the VCF file. This may not be applicable to all VCF files and could lead to incorrect results if the VCF file does not follow this naming convention. Ensure that this operation is necessary and correct for all input files.

46-54: The script modifies the genotype field of the VCF file based on the haplotype data. This operation assumes that the format fields of the VCF and haplotype files are in the same order, which may not always be the case. Consider adding checks to ensure that the format fields match.

resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (6)

313-317: The new condition "quantile" for dh_Stop is added correctly. Ensure that the new condition is handled wherever dh_Stop is used in the codebase.

540-540: The assignment of limitDHOrig to limitDH is correct. Ensure that limitDHOrig is used correctly in the rest of the codebase.

634-637: The error message and termination if all data points are equal to 6 is a good addition for error handling. Ensure that the error message is clear and informative for the user.

764-772: The addition of new variables plotPurities_DH and plotPurities_NTCN is correct. Ensure that these variables are used correctly in the rest of the codebase.

777-777: The modification of the df.lines data frame to include purity_DH and purity_NTCN is correct. Ensure that these new columns are used correctly in the rest of the codebase.

816-818: The modification of the ggplot function to include new points for purity_DH and purity_NTCN is correct. Ensure that these new points are displayed correctly in the plot.

resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (9)

8-11: Ensure that the isNoControlWorkflow variable is set correctly before this script is run. Also, verify that the FILE_CONTROL_BAM file exists and is accessible.

24-31: The script exits if the patient is not female or klinefelter. Ensure that this behavior is expected and acceptable in all use cases.

33-45: This block of code is executed only if isNoControlWorkflow is false. Ensure that the BCFTOOLS_BINARY, CNV_MPILEUP_OPTS, REFERENCE_GENOME, CHR_NR, FILE_CONTROL_BAM, and BCFTOOLS_OPTS variables are set correctly before this script is run. Also, verify that the BCFTOOLS_BINARY tool is installed and accessible.

47-49: These lines are creating empty files. Ensure that the script has write permissions to the directory where these files are being created.

51-54: Ensure that the PYTHON_BINARY, TOOL_BEAGLE_CREATE_FAKE_SAMPLES, and UNPHASED variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

61-70: Ensure that the JAVA_OPTIONS, BEAGLE_BINARY, UNPHASED_TWOSAMPLES, BEAGLE_REFERENCE_FILE_X, PHASED_TWOSAMPLES, BEAGLE_GENETIC_MAP_X, and BEAGLE_NUM_THREAD variables are set correctly before this script is run. Also, verify that the BEAGLE_BINARY tool is installed and accessible.

73-77: Ensure that the PYTHON_BINARY, TOOL_BEAGLE_EMBED_HAPLOTYPES_VCF, PHASED_TWOSAMPLES, UNPHASED, and tmpPhased variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

80-84: Ensure that the PYTHON_BINARY, TOOL_GROUP_HAPLOTYPES, tmpPhased, tmpHaploblocks, and minHT variables are set correctly before this script is run. Also, verify that the PYTHON_BINARY tool is installed and accessible.

87-88: These lines are moving temporary files to their final locations. Ensure that the script has write permissions to the directory where these files are being moved.

resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (1)

113-115: The code adds gene labels to the plot. Ensure that the labels do not overlap or clutter the plot, which could make it difficult to read.

resources/analysisTools/copyNumberEstimationWorkflow/datatablePSCBSgaps.sh

coderabbitai · 2023-11-10T08:36:26Z

resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh


 tmp_corrected_windowfile="$FILENAME_GC_CORRECTED_WINDOWS.tmp"

-$RSCRIPT_BINARY "$TOOL_CORRECT_GC_BIAS_R \
-	--windowFile "$cnvSnpOutputDirectory/$PID".all.cnv.chr*.tab \
-	--timefile   "$REPLICATION_TIME_FILE" \
-	--chrLengthFile $CHROMOSOME_LENGTH_FILE" \
-	--pid	     "$PID" \
-	--email      "$EMAIL" \
-	--outfile    "$tmp_corrected_windowfile" \
-	--corPlot    "$FILENAME_GC_CORRECT_PLOT" \
-	--gcFile     "$GC_CONTENT_FILE" \
-	--outDir     "$aceseqOutputDirectory" \
-	--lowess_f   "$LOWESS_F" \
-	--scaleFactor "$SCALE_FACTOR"
+${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" \
+	--windowFile "${cnvSnpOutputDirectory}/${PID}.all.cnv.chr*.tab" \
+	--timefile   "${REPLICATION_TIME_FILE}" \
+	--chrLengthFile "${CHROMOSOME_LENGTH_FILE}" \
+	--pid	     "${PID}" \
+	--email      "${EMAIL}" \
+	--outfile    "${tmp_corrected_windowfile}" \
+	--corPlot    "${FILENAME_GC_CORRECT_PLOT}" \
+	--gcFile     "${GC_CONTENT_FILE}" \
+	--outDir     "${aceseqOutputDirectory}" \
+	--lowess_f   "${LOWESS_F}" \
+	--scaleFactor "${SCALE_FACTOR}"





Note: This review was outside of the patch, but no patch was found that overlapped with it. Original lines [24-28]

The error handling is good, but it could be improved by providing more specific error messages. For example, you could capture the error output of the command and include it in the error message. This would help with debugging if something goes wrong.

- if [[ $? != 0 ]] + if ! output=$(${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" ... 2>&1); then - echo "Something went wrong during GC correction. Program had non-zero exit status, exiting pipeline...\n\n" + echo "Error during GC correction: $output" exit 2 fi

coderabbitai · 2023-11-10T08:36:27Z

resources/analysisTools/copyNumberEstimationWorkflow/homozygDel.sh


 mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}

-$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"
+$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"


The --comment chromosome option in the $TABIX_BINARY command is used to specify that lines starting with 'chromosome' are treated as comments and ignored. Ensure that this is the intended behavior and that it doesn't affect the processing of the input file.

- $TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}" + $TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}

$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"

$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"

mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL}

$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"

coderabbitai · 2023-11-10T08:36:27Z

resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py

@@ -235,6 +235,6 @@ def merge_first_segment(prior_line, newline):
 			newline=None

 	#print last line(s)
-	out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" ) 
+	out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" ) 


The prior_line variable is being indexed with 0 which might cause a TypeError if prior_line is not a list. Ensure that prior_line is always a list with at least one element when this line is executed.

- out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" ) + out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" )

out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )

coderabbitai · 2023-11-10T08:36:27Z

resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R

+        if (runInDebugMode == "true") {
+          beVerbose = T
+        } else {
+          beVerbose = F
+        }


The conditional assignment to beVerbose can be simplified to a single line. This reduces complexity and improves readability.

- if (runInDebugMode == "true") { - beVerbose = T - } else { - beVerbose = F - } + beVerbose = runInDebugMode == "true"

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

if (runInDebugMode == "true") {

beVerbose = T

} else {

beVerbose = F

}

beVerbose = runInDebugMode == "true"

coderabbitai · 2023-11-10T08:36:29Z

resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py

+if args.hap_file.split('.')[-1] == 'gz':
+    hap_file = gzip.open( args.hap_file, "r" )
+else:
+    hap_file = open( args.hap_file, "r" )


The script opens files but does not close them. This can lead to resource leaks. Consider using the with statement to ensure files are properly closed after they are no longer needed.

with open(args.hap_file, "r") as hap_file: # rest of the code

coderabbitai · 2023-11-10T08:36:29Z

resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py

+            while hap_line and int(hap_line[1]) < int(vcf_line[1]):
+                hap_line = hap_file.readline()
+
+                if hap_line:
+                    hap_line = hap_line.rstrip().split()
+


The script reads lines from the haplotype file in a loop until it finds a line with a position that matches the current VCF line. If the haplotype file is large, this could be inefficient. Consider using a more efficient data structure, such as a dictionary or a binary search tree, to store the haplotype data.

coderabbitai · 2023-11-10T08:36:29Z

resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py

+    if is_comment_line(vcf_line):
+        if vcf_line.startswith("#CHROM"):
+            vcf_line = vcf_line.rstrip().split("\t")
+            vcf_line.append('sample0')


The "fake" sample name 'sample0' is hardcoded. Consider making this configurable through a command-line argument for flexibility.

- parser.add_argument('--out_file', help='Out .vcf file') + parser.add_argument('--out_file', help='Out .vcf file') + parser.add_argument('--sample_name', default='sample0', help='Sample name to be added') ... - vcf_line.append('sample0') + vcf_line.append(args.sample_name)

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

vcf_line.append('sample0')

parser.add_argument('--out_file', help='Out .vcf file')

parser.add_argument('--sample_name', default='sample0', help='Sample name to be added')

...

vcf_line.append(args.sample_name)

coderabbitai · 2023-11-10T08:36:29Z

resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R

+			if (annotatePlotsWithGenes & is.data.frame(geneAnnotations)) {
+			  genesSubset = geneAnnotations[geneAnnotations$chr==chromosome,]
+			  genesSubset = genesSubset[order(genesSubset$start),]
+
+			  if (nrow(genesSubset)>0) {
+			    genesSubset$yOffset = 0
+			    # determine yPos offset
+			    if (nrow(genesSubset)>1) {
+			      genesSubset$distance = c(NA,sapply(2:nrow(genesSubset), function(j) {
+			        genesSubset[(j),"start"] - genesSubset[(j-1),"start"]
+			      }))
+			      genesSubset$yHasOffset = genesSubset$distance < 13e6
+			      genesSubset$yHasOffset[is.na(genesSubset$yHasOffset)] = F
+			      for (j in 2:nrow(genesSubset)) {
+			        if (genesSubset[j,"yHasOffset"]) {
+			          genesSubset[j,"yOffset"] = genesSubset[j-1,"yOffset"]+1
+			        }
+			      }
+			    }
+
+			    genesSubset$position = (genesSubset$start+(genesSubset$end - genesSubset$start)/2)/10/xtotal
+
+			    if (!unconditionalGeneLabeling) {
+			      genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
+			      merged = mergeByOverlaps(genesSubset.gr, seg.gr)
+			      eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
+			      index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
+			      genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
+			    }
+			    if (nrow(genesSubset)>0) {
+			      p <- p + geom_vline( xintercept=genesSubset$position, col = "grey80", lty=5, lwd =0.2 )
+			      p <- p + geom_text(data=genesSubset, aes(x=position, y=ymaxcov+0.2-0.08*ymaxcov*yOffset, label = gene), cex = 4, col = "red")
+			    }
+			  }
+			}


The code block for gene annotation is quite large and complex. Consider breaking it down into smaller, more manageable functions for better readability and maintainability.

coderabbitai · 2023-11-10T08:36:29Z

resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R

+			      genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)
+			      merged = mergeByOverlaps(genesSubset.gr, seg.gr)
+			      eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x
+			      index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)
+			      genesSubset = as.data.frame(genesSubset[index.affectedSegments,])
+			    }


The code checks for unconditionalGeneLabeling and then performs operations if it's not set. This could be confusing. Consider reversing the logic for better readability.

- if (!unconditionalGeneLabeling) { + if (unconditionalGeneLabeling) { + # Skip the operations + } else {

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change

genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)

merged = mergeByOverlaps(genesSubset.gr, seg.gr)

eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x

index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)

genesSubset = as.data.frame(genesSubset[index.affectedSegments,])

}

if (unconditionalGeneLabeling) {

# Skip the operations

} else {

genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T)

merged = mergeByOverlaps(genesSubset.gr, seg.gr)

eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x

index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T)

genesSubset = as.data.frame(genesSubset[index.affectedSegments,])

}

vinjana · 2024-02-22T08:21:34Z

resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl

Generally, can you replace the ARGV access by something more semantic?

my $phasingOutDir, $unphasedGenotypesFilePrefix, $unphasedGenotypesFileSuffix, $chrPrefix = @ARGV;

Then also CodeRabbit's correct error remark will disappear.

GWarsow and others added 30 commits November 14, 2018 15:23

star plot: green (NCTN) and yellow (DH>1) line instead of common red …

ea53bd3

…line

added PID to covBaf plot filename

3836328

removed inappropriate cvalues from config xml file

bba909d

integrated overseen code (dh_Stop method 'quanile', exit due to 'high…

e0d4a50

… purity issue')

Merge branch 'master' into develop

9586a6a

corrected indentation

3109026

added contributing segments file for LST

5e86bfa

implemented gene annotation in TCN chromosome plots [turned off by de…

3b3d7fb

…fault]

added missing parameter (secondChoicePloidyFilnameAddition)

68f24bc

added debugging lines

4ba122f

use Mclust with verbose=TRUE when cvalue runInDebugMode is set to "true"

1dd2cf7

quoted email parameter value

530ba17

Changed the phasing routine by exchanging impute2 with Beagle

78e1612

renamed phasing scripts

1888521

Cleaned phasing procedure. Deleted all files and paths related to imp…

368a3c5

…ute2 (including the impute2 binary). Also renamed tools and folders from 'impute' to 'phase' as we are not imputing but using Beagle to phase the haplotypes.

Fixed typo

1a8069a

Finished off cleaning of phasing procedure. Renaming of files and too…

de2b5a0

…ls such that there is no mentioning of the word 'impute' anywhere.

updated parameter.rst table with beagle parameters

a96db03

changed download file to include Bealge references

eff8785

Additional changes to the phasing routine: adapted the downloadRefere…

e49047a

…nces.sh file for the Beagle references. Updated Documentation. Cleaned and renamed other files with respect to impute.

Adapted configuration file to match the location of the Beagle refere…

9600f4b

…nce files in the /icgc/ngs_share/assemblies folder

Compiled plugin and changed version number to 6.0.0

27451e6

Fix crash: Included the parameter 'sharedFilesBaseDirectory' in the c…

c8a953f

…onfig file

Changed the location of the Beagle reference files in the configuration

0a6de5a

Merge branch 'renamed'

fd79e51

Conflicts: resources/configurationFiles/analysisCopyNumberEstimation.xml

Minor change in phasing.sh to ensure correct handling by Roddy when j…

c9aa6a0

…ob is interrupted

changed documentation

2160ea7

Changes to phasing_x.sh for better roddy compatibility

f880d0a

Changed tabs to spaces

f11ed17

better way of checking if line is comment

9a41797

NagaComBio requested a review from tlkaufmann May 26, 2021 09:16

vinjana reviewed Jun 7, 2021

View reviewed changes

NagaComBio added 4 commits June 14, 2021 10:16

README & scripts to prepare hg38 reference files

9b86c36

Based on code-review suggestions

52c6c5a

Merge branch 'hg38' of https://github.com/DKFZ-ODCF/ACEseqWorkflow in…

7b0d1ff

…to hg38

fixing typo in GRCh38_related_files/README

f52f086

vinjana reviewed Jul 29, 2021

View reviewed changes

NagaComBio added 14 commits June 28, 2022 14:42

Add '--comment' to tabix scripts

e6a5845

Update LSF memory requirements

30555e1

Merge branch 'master' into hg38

09bd58f

Conflicts: installation/downloadReferences.sh

Add temp hg38 centromer file

fe1b82b

Upgrade COWorkflowsBasePlugin to 1.3.0

d5706f9

update the hg38 centromer file

e3fc99f

Remove the NULL line in the pruned file

1229a24

Add druggable gene list

eb26938

Update GRCh38 installation files

e783d7e

Update hg38 mappability files to M2E2 with ALT & HLA contigs

6549fe0

update homodel artifact list

779d5b0

update the shebang lines

fc26eb6

Update README

8655540

NagaComBio commented Oct 6, 2022

View reviewed changes

NagaComBio added 5 commits October 31, 2022 16:21

Remove beagle jar & update beagle to 5.2

8a29539

Update COWorkflowsBasePlugin to 1.4.2

8ffd230

Update ref file location

2d0ec65

Fix tmpHaploblocks unbound variable issue

8eda5e9

Fix for NoControl workflow

ab85dc9

coderabbitai bot reviewed Nov 10, 2023

View reviewed changes

vinjana reviewed Feb 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

NagaComBio commented May 26, 2021 •

edited by coderabbitai bot

Loading

NagaComBio left a comment

coderabbitai bot commented Nov 10, 2023 •

edited

Loading

Chat with CodeRabbit Bot (`@coderabbitai`)

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

coderabbitai bot left a comment

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

coderabbitai bot Nov 10, 2023

vinjana Feb 22, 2024

	out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" )
	out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )

-            vcf_line.append('sample0')
+            parser.add_argument('--out_file', help='Out .vcf file')
+            parser.add_argument('--sample_name', default='sample0', help='Sample name to be added')
+            ...
+            vcf_line.append(args.sample_name)

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

Are you sure you want to change the base?

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

Conversation

NagaComBio commented May 26, 2021 • edited by coderabbitai bot Loading