-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20
base: master
Are you sure you want to change the base?
Conversation
…ute2 (including the impute2 binary). Also renamed tools and folders from 'impute' to 'phase' as we are not imputing but using Beagle to phase the haplotypes.
…ls such that there is no mentioning of the word 'impute' anywhere.
…nces.sh file for the Beagle references. Updated Documentation. Cleaned and renamed other files with respect to impute.
…nce files in the /icgc/ngs_share/assemblies folder
Conflicts: resources/configurationFiles/analysisCopyNumberEstimation.xml
…ob is interrupted
resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh
Outdated
Show resolved
Hide resolved
resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py
Show resolved
Hide resolved
resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh
Outdated
Show resolved
Hide resolved
Conflicts: installation/downloadReferences.sh
Conflicts: ACEseqWorkflow_5.0.1.iml ACEseqWorkflow_5.1.0.iml ACEseqWorkflow_6.0.0.iml README.ACEseqWorkflow.txt resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh resources/configurationFiles/analysisCopyNumberEstimation.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mature now and we could start merging this to master/main.
WalkthroughThe changes primarily involve a major version update from 5.0.0 to 6.0.0 for the ACEseq workflow. The update includes replacing the "impute2" program with "Beagle" for phasing genotypes, renaming tools and files, updating dependencies, adding support for the GRCh38 reference genome, and introducing new features and bug fixes. The changes affect various aspects of the workflow, including file handling, data processing, documentation, and scripts for downloading and processing reference genome files. Changes
TipsChat with CodeRabbit Bot (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 28
Configuration used: CodeRabbit UI
Files ignored due to filter (3)
- ACEseqWorkflow.jar
- resources/configurationFiles/analysisCopyNumberEstimation.xml
- resources/configurationFiles/analysisCopyNumberEstimation_GRCh38.xml
Files selected for processing (57)
- .gitignore (1 hunks)
- README.ACEseqWorkflow.txt (1 hunks)
- buildinfo.txt (1 hunks)
- buildversion.txt (1 hunks)
- documentation/source/methods.rst (1 hunks)
- documentation/source/parameters.rst (4 hunks)
- installation/GRCh38_related_files/README_downloadReferences_GRCh38.md (1 hunks)
- installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1 hunks)
- installation/GRCh38_related_files/calc_gc_content_bed.py (1 hunks)
- installation/GRCh38_related_files/create_mappability.sh (1 hunks)
- installation/downloadReferences.sh (3 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/datatablePSCBSgaps.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (2 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (2 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (3 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/homozygDel.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (2 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/plots.sh (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (7 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (3 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (5 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (2 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py (1 hunks)
- resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1 hunks)
- src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java (1 hunks)
- src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (2 hunks)
Files not processed due to max files limit (3)
- src/de/dkfz/b080/co/aceseq/ACESeqWorkflow.java
- src/de/dkfz/b080/co/aceseq/ACEseqConstants.java
- src/de/dkfz/b080/co/files/PhaseGenotypeByChromosome.java
Files skipped from review due to trivial changes (27)
- .gitignore
- buildversion.txt
- installation/GRCh38_related_files/README_downloadReferences_GRCh38.md
- installation/downloadReferences.sh
- resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_CRESTpoints.py
- resources/analysisTools/copyNumberEstimationWorkflow/PSCBSgabs_plus_sv_points.py
- resources/analysisTools/copyNumberEstimationWorkflow/addMappability.py
- resources/analysisTools/copyNumberEstimationWorkflow/add_haplotypes.py
- resources/analysisTools/copyNumberEstimationWorkflow/centromers.hg38.txt
- resources/analysisTools/copyNumberEstimationWorkflow/convertTabToJson.py
- resources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias.sh
- resources/analysisTools/copyNumberEstimationWorkflow/extract_genotype_vcf.py
- resources/analysisTools/copyNumberEstimationWorkflow/getFinalPurityPloidy.py
- resources/analysisTools/copyNumberEstimationWorkflow/group_genotypes.py
- resources/analysisTools/copyNumberEstimationWorkflow/mergeArtifacts.py
- resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_cnv.py
- resources/analysisTools/copyNumberEstimationWorkflow/merge_and_filter_snp.py
- resources/analysisTools/copyNumberEstimationWorkflow/parseJson.py
- resources/analysisTools/copyNumberEstimationWorkflow/plots.sh
- resources/analysisTools/copyNumberEstimationWorkflow/pscbs_all.R
- resources/analysisTools/copyNumberEstimationWorkflow/removeBreakpoints.py
- resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data.py
- resources/analysisTools/copyNumberEstimationWorkflow/segments_to_data_homodel.py
- resources/analysisTools/copyNumberEstimationWorkflow/smoothDataMergeSimilar.py
- resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv.py
- resources/analysisTools/copyNumberEstimationWorkflow/snp_cnv_all.py
- src/de/dkfz/b080/co/CopyNumberEstimationWorkflowPlugin.java
Additional comments: 83
resources/analysisTools/copyNumberEstimationWorkflow/createUnphasedFiles.sh (1)
- 12-15: Ensure that the Perl script
$TOOL_PARSE_VCF
can handle the new argument$CHR_PREFIX
and that all invocations of this script have been updated accordingly.resources/analysisTools/copyNumberEstimationWorkflow/bashLib.sh (1)
- 1-11: The
dieWith
function is a good addition for error handling in bash scripts. It takes an error message and an optional exit code, prints the error message to stderr, and then exits with the provided exit code. This is a good practice for error handling in bash scripts.resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38.txt (1)
- 1-8: The new data table looks fine. Ensure that the chromosome identifiers, start and end positions, and frequency values are correct and consistent with the rest of your data.
README.ACEseqWorkflow.txt (1)
- 15-40: The changelog is well-documented and provides a clear understanding of the changes made in each version. Ensure that all the changes mentioned are reflected in the code and the documentation.
documentation/source/methods.rst (1)
- 2-8: The documentation has been updated to reflect the change in the method of phasing genotypes of SNP positions, switching from Impute2 to Beagle. This change affects the assignment of A and B alleles and impacts the estimation of B-allele frequencies (BAFs) for SNP positions in tumor and control samples.
resources/analysisTools/copyNumberEstimationWorkflow/environments/tbi-cluster.sh (4)
10-13: Ensure that the "beagle" module is available and the version is correctly specified in the environment variable
$BEAGLE_VERSION
.15-15: The path to the virtual environment has been changed. Make sure the new path is correct and the virtual environment is properly set up at this location.
38-39: Ensure that the "bcftools" module is available and the version is correctly specified in the environment variable
$BCFTOOLS_VERSION
.41-42: Ensure that the "java" module is available and the version is correctly specified in the environment variable
$JAVA_VERSION
.buildinfo.txt (1)
- 1-1: Ensure that the updated version of COWorkflowsBasePlugin (1.4.2) is compatible with the current versions of RoddyAPIVersion (3.2), GroovyVersion (2.4), and JDKVersion (1.8). Also, verify that all the dependencies of COWorkflowsBasePlugin 1.4.2 are met in the current environment.
resources/analysisTools/copyNumberEstimationWorkflow/snvMergeFilter.sh (1)
- 28-28: The
--comment chr
option is added to the${TABIX_BINARY}
command. Ensure that this change is compatible with the expected input and output of the${TABIX_BINARY}
command in the context of the overall workflow.- ${TABIX_BINARY} -s 1 -b 2 -e 2 $tmp_snp_filename + ${TABIX_BINARY} -s 1 -b 2 -e 2 --comment chr $tmp_snp_filenameresources/analysisTools/copyNumberEstimationWorkflow/correct_gc_bias_1kb.sh (1)
- 6-20: The changes made to the script are good for ensuring proper variable expansion and command execution. The use of curly braces for variable interpolation and quotes for all variable references are best practices in shell scripting. This helps to prevent word splitting and pathname expansion.
src/de/dkfz/b080/co/aceseq/ACESeqMethods.groovy (4)
27-31: The
getGlobalJobSpecificParameters
function now returns aLinkedHashMap
with only one key-value pair. Ensure that this change does not affect other parts of the code that use this function.33-33: The
generateCNVSNPs
function has its return type changed toCnvSnpGeneratorResultByType
. Make sure that all calls to this function are updated to handle the new return type.55-64: The
imputeGenotypes
function is renamed tophaseGenotypes
, and its return type is changed toPhaseGenotypeByChromosome
. Ensure that all calls to this function are updated to match the new function name and handle the new return type.80-86: The
imputeGenotypeX
function is renamed tophaseGenotypeX
. Ensure that all calls to this function are updated to match the new function name.resources/analysisTools/copyNumberEstimationWorkflow/smoothData.py (1)
- 1-1: The shebang has been updated to use
#!/usr/bin/env python
for better portability across different environments. This is a good practice as it allows the script to run in environments where Python might not be installed in the standard location.resources/analysisTools/copyNumberEstimationWorkflow/estimateHRDScore.sh (4)
19-25: The variable
mostImportantFile
is assigned but never used. If it's not needed, consider removing it to improve code readability.90-111: Ensure that the new file paths and variables are correctly set and accessible. Also, verify that the new arguments for the script are correctly passed and handled in the script.
129-132: Ensure that the temporary files being moved have been created successfully in the previous steps. If the creation of these files fails, the
mv
command will also fail.133-133: Ensure that
${combProFile}.tmp
is no longer needed before removing it. If it's needed later in the code, this line will cause issues.resources/analysisTools/copyNumberEstimationWorkflow/embed_haplotypes_vcf.py (3)
1-1: The shebang line is correctly formatted without leading spaces.
8-10: The function
is_comment_line
is a good abstraction that improves readability.26-29: The check for comment lines using the new function improves readability and maintainability.
resources/analysisTools/copyNumberEstimationWorkflow/manual_pruning.R (1)
- 368-368: Ensure that the
Mclust
function can handle theverbose
argument being a boolean. If it expects a string, you may need to convertbeVerbose
to a string before passing it to the function.resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots.R (5)
19-24: The new parameter
annotatePlotsWithGenes
is added to the specification matrix for the getopt function. Ensure that this parameter is being passed correctly wherever this script is being called.66-70: The
annotatePlotsWithGenes
parameter is being converted to a boolean value. This is a good practice for clarity and type safety.107-110: The
annotatePlotsWithGenes
parameter is added to theplotChromosomes
function. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.129-132: The
annotatePlotsWithGenes
parameter is being used in theplotTCN
function call within theplotChromosomes
function. Ensure that theplotTCN
function has been updated to handle this parameter.284-298: The
annotatePlotsWithGenes
parameter is being used in theplotChromosomes
function call within the loop. This is a good use of the parameter.installation/GRCh38_related_files/beagle_vcf_to_bref.sh (1)
- 1-14: This script assumes that the required Java version and the
bref3.18May20.d20.jar
file are available in the current environment. Please ensure that these dependencies are properly installed and accessible.installation/GRCh38_related_files/create_mappability.sh (1)
- 2-2: The
set -ue -o pipefail
command is a good practice in bash scripts. It ensures that the script exits if any command fails (set -e), if any variable is not set (set -u), and if any part of a pipe fails (set -o pipefail).installation/GRCh38_related_files/calc_gc_content_bed.py (10)
1-5: The script starts with a shebang line and imports necessary modules. It's good practice to include a shebang line at the top of the script, which allows the script to be run as a standalone executable.
6-6: The script checks for Python 3, which is a good practice to ensure compatibility.
8-14: The script uses argparse to handle command-line arguments, which is a standard and recommended way to do so in Python.
16-17: The script converts command-line arguments to appropriate data types. This is a good practice to ensure that the data types are as expected.
19-20: The script initializes an output data list with a header line. This is a good practice for creating output files that are easy to understand.
21-24: The script reads the input file and splits it into lines. This is a standard way to read a file in Python.
25-33: The script calculates chromosome stats. It's good to see that the script is modular and each part of the code is responsible for a specific task.
34-38: The script uses a regular expression to filter chromosome names. This is a good practice to ensure that only valid chromosome names are processed.
39-54: The script calculates GC content for each region and appends the results to the output data list. This is the main functionality of the script and it seems to be implemented correctly.
56-59: The script writes the output data to a file. This is a standard way to write data to a file in Python.
documentation/source/parameters.rst (4)
11-17: The
imputeOutputDirectory
has been renamed tophasingOutputDirectory
. Ensure that all references to this directory in the codebase have been updated accordingly.42-48: The
haplogroupFilePath
now uses thephasingOutputDirectory
instead of theimputeOutputDirectory
. Ensure that the file generation and usage are updated accordingly.73-81: The
FILE_UNPHASED_PRE
andFILE_PHASED_GENOTYPE
paths now use thephasingOutputDirectory
instead of theimputeOutputDirectory
. Ensure that the file generation and usage are updated accordingly.96-108: New variables
BEAGLE_REFERENCE_FILE
,BEAGLE_REFERENCE_FILE_X
,BEAGLE_GENETIC_MAP
, andBEAGLE_GENETIC_MAP_X
have been introduced. Ensure that these are used correctly in the codebase and that the paths are valid.resources/analysisTools/copyNumberEstimationWorkflow/phasing.sh (7)
2-2: The script is set to fail on any error or unset variable, which is a good practice for robustness and error handling.
16-33: The script checks if it's running in a no-control workflow and if not, it uses bcftools to generate an unphased VCF file from the control BAM file. Error handling is done properly with a custom function
dieWith
.39-43: The script creates a fake sample using a Python script. Error handling is done properly with a custom function
dieWith
.44-53: The script sets Java options for Beagle and then runs Beagle for phasing. Error handling is done properly with a custom function
dieWith
.55-59: The script embeds haplotypes into the VCF file using a Python script. Error handling is done properly with a custom function
dieWith
.61-65: The script groups haplotypes using a Python script. Error handling is done properly with a custom function
dieWith
.67-68: The script moves temporary files to their final destinations. It's a good practice to use temporary files to avoid partial writes in case of errors.
resources/analysisTools/copyNumberEstimationWorkflow/HRD_estimation.R (5)
17-30: The new argument
cutoff
is optional and defaults to 0.7 if not provided. Ensure that all calls to this script throughout the codebase have been updated to match the new signature if necessary.97-103: The script now excludes chromosomes that have not changed from the merged data frame. It also writes the indices of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH to the
contributingSegmentsFile_HRD_File
.144-157: The script now writes the reduced merged data frame to
merged_reduced_outputFile
. It also calculates the number of segments that meet the condition of having a length greater than 15,000,000 and a CNA type of LOH or DEL in both the original and reduced merged data frames. These indices are written tocontributingSegmentsFile_HRD_File
and a new file with the suffix.CentromerReduced.txt
.212-218: A new variable
LST_SEGMENTS_INDICES
is introduced to store indices of segments in the reduced merged data frame that meet a specific condition.224-233: The script now writes the indices stored in
LST_SEGMENTS_INDICES
tocontributingSegments_LST_File
ifLSTReduced
is greater than 0.resources/analysisTools/copyNumberEstimationWorkflow/clusteredPrunedNormal.sh (1)
- 43-45: The script is removing lines containing "NULL" from the file. This could potentially remove valid data if "NULL" appears in the data itself. Ensure that "NULL" only appears in invalid lines.
resources/analysisTools/copyNumberEstimationWorkflow/parseVcf.pl (1)
- 24-30: The code modifies the first element of
@fields
by prepending$ARGV[3]
if$filehandle{$fields[0]}
exists. Ensure that this change doesn't break any dependencies on the original value of$fields[0]
elsewhere in the code.resources/analysisTools/copyNumberEstimationWorkflow/beagle_create_fake_samples.py (5)
1-1: Ensure that the Python interpreter is available in the environment where this script will be run.
3-3: Good use of the argparse module for command-line argument parsing.
5-7: The function
is_comment_line
is correctly implemented to check if a line is a comment.8-11: The command-line arguments are correctly parsed and assigned to variables.
16-29: The script correctly modifies the input VCF file by adding a new column and duplicating the 10th column's value to the end of each line.
resources/analysisTools/copyNumberEstimationWorkflow/beagle_embed_haplotypes_vcf.py (2)
33-34: The script replaces 'chr' and 'X' in the chromosome field of the VCF file. This may not be applicable to all VCF files and could lead to incorrect results if the VCF file does not follow this naming convention. Ensure that this operation is necessary and correct for all input files.
46-54: The script modifies the genotype field of the VCF file based on the haplotype data. This operation assumes that the format fields of the VCF and haplotype files are in the same order, which may not always be the case. Consider adding checks to ensure that the format fields match.
resources/analysisTools/copyNumberEstimationWorkflow/purity_ploidy_estimation_final.R (6)
313-317: The new condition "quantile" for
dh_Stop
is added correctly. Ensure that the new condition is handled whereverdh_Stop
is used in the codebase.540-540: The assignment of
limitDHOrig
tolimitDH
is correct. Ensure thatlimitDHOrig
is used correctly in the rest of the codebase.634-637: The error message and termination if all data points are equal to 6 is a good addition for error handling. Ensure that the error message is clear and informative for the user.
764-772: The addition of new variables
plotPurities_DH
andplotPurities_NTCN
is correct. Ensure that these variables are used correctly in the rest of the codebase.777-777: The modification of the
df.lines
data frame to includepurity_DH
andpurity_NTCN
is correct. Ensure that these new columns are used correctly in the rest of the codebase.816-818: The modification of the
ggplot
function to include new points forpurity_DH
andpurity_NTCN
is correct. Ensure that these new points are displayed correctly in the plot.resources/analysisTools/copyNumberEstimationWorkflow/phasing_X.sh (9)
8-11: Ensure that the
isNoControlWorkflow
variable is set correctly before this script is run. Also, verify that theFILE_CONTROL_BAM
file exists and is accessible.24-31: The script exits if the patient is not female or klinefelter. Ensure that this behavior is expected and acceptable in all use cases.
33-45: This block of code is executed only if
isNoControlWorkflow
is false. Ensure that theBCFTOOLS_BINARY
,CNV_MPILEUP_OPTS
,REFERENCE_GENOME
,CHR_NR
,FILE_CONTROL_BAM
, andBCFTOOLS_OPTS
variables are set correctly before this script is run. Also, verify that theBCFTOOLS_BINARY
tool is installed and accessible.47-49: These lines are creating empty files. Ensure that the script has write permissions to the directory where these files are being created.
51-54: Ensure that the
PYTHON_BINARY
,TOOL_BEAGLE_CREATE_FAKE_SAMPLES
, andUNPHASED
variables are set correctly before this script is run. Also, verify that thePYTHON_BINARY
tool is installed and accessible.61-70: Ensure that the
JAVA_OPTIONS
,BEAGLE_BINARY
,UNPHASED_TWOSAMPLES
,BEAGLE_REFERENCE_FILE_X
,PHASED_TWOSAMPLES
,BEAGLE_GENETIC_MAP_X
, andBEAGLE_NUM_THREAD
variables are set correctly before this script is run. Also, verify that theBEAGLE_BINARY
tool is installed and accessible.73-77: Ensure that the
PYTHON_BINARY
,TOOL_BEAGLE_EMBED_HAPLOTYPES_VCF
,PHASED_TWOSAMPLES
,UNPHASED
, andtmpPhased
variables are set correctly before this script is run. Also, verify that thePYTHON_BINARY
tool is installed and accessible.80-84: Ensure that the
PYTHON_BINARY
,TOOL_GROUP_HAPLOTYPES
,tmpPhased
,tmpHaploblocks
, andminHT
variables are set correctly before this script is run. Also, verify that thePYTHON_BINARY
tool is installed and accessible.87-88: These lines are moving temporary files to their final locations. Ensure that the script has write permissions to the directory where these files are being moved.
resources/analysisTools/copyNumberEstimationWorkflow/pscbs_plots_functions.R (1)
- 113-115: The code adds gene labels to the plot. Ensure that the labels do not overlap or clutter the plot, which could make it difficult to read.
resources/analysisTools/copyNumberEstimationWorkflow/datatablePSCBSgaps.sh
Show resolved
Hide resolved
|
||
tmp_corrected_windowfile="$FILENAME_GC_CORRECTED_WINDOWS.tmp" | ||
|
||
$RSCRIPT_BINARY "$TOOL_CORRECT_GC_BIAS_R \ | ||
--windowFile "$cnvSnpOutputDirectory/$PID".all.cnv.chr*.tab \ | ||
--timefile "$REPLICATION_TIME_FILE" \ | ||
--chrLengthFile $CHROMOSOME_LENGTH_FILE" \ | ||
--pid "$PID" \ | ||
--email "$EMAIL" \ | ||
--outfile "$tmp_corrected_windowfile" \ | ||
--corPlot "$FILENAME_GC_CORRECT_PLOT" \ | ||
--gcFile "$GC_CONTENT_FILE" \ | ||
--outDir "$aceseqOutputDirectory" \ | ||
--lowess_f "$LOWESS_F" \ | ||
--scaleFactor "$SCALE_FACTOR" | ||
${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" \ | ||
--windowFile "${cnvSnpOutputDirectory}/${PID}.all.cnv.chr*.tab" \ | ||
--timefile "${REPLICATION_TIME_FILE}" \ | ||
--chrLengthFile "${CHROMOSOME_LENGTH_FILE}" \ | ||
--pid "${PID}" \ | ||
--email "${EMAIL}" \ | ||
--outfile "${tmp_corrected_windowfile}" \ | ||
--corPlot "${FILENAME_GC_CORRECT_PLOT}" \ | ||
--gcFile "${GC_CONTENT_FILE}" \ | ||
--outDir "${aceseqOutputDirectory}" \ | ||
--lowess_f "${LOWESS_F}" \ | ||
--scaleFactor "${SCALE_FACTOR}" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This review was outside of the patch, but no patch was found that overlapped with it. Original lines [24-28]
The error handling is good, but it could be improved by providing more specific error messages. For example, you could capture the error output of the command and include it in the error message. This would help with debugging if something goes wrong.
- if [[ $? != 0 ]]
+ if ! output=$(${RSCRIPT_BINARY} "${TOOL_CORRECT_GC_BIAS_R}" ... 2>&1); then
- echo "Something went wrong during GC correction. Program had non-zero exit status, exiting pipeline...\n\n"
+ echo "Error during GC correction: $output"
exit 2
fi
|
||
mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL} | ||
|
||
$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}" | ||
$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --comment chromosome
option in the $TABIX_BINARY
command is used to specify that lines starting with 'chromosome' are treated as comments and ignored. Ensure that this is the intended behavior and that it doesn't affect the processing of the input file.
- $TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}"
+ $TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}"
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL} | |
$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}" | |
$TABIX_BINARY -f -s 1 -b 2 -e 3 --comment chromosome "${FILENAME_SEGMENTS_W_HOMDEL}" | |
mv ${tmpSegsHomDel} ${FILENAME_SEGMENTS_W_HOMDEL} | |
$TABIX_BINARY -f -s 1 -b 2 -e 3 "${FILENAME_SEGMENTS_W_HOMDEL}" |
@@ -235,6 +235,6 @@ def merge_first_segment(prior_line, newline): | |||
newline=None | |||
|
|||
#print last line(s) | |||
out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" ) | |||
out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prior_line
variable is being indexed with 0
which might cause a TypeError if prior_line
is not a list. Ensure that prior_line
is always a list with at least one element when this line is executed.
- out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" )
+ out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" )
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
out.write( "\t".join( str(prior_line[0][key]) for key in infile.header ) + "\n" ) | |
out.write( "\t".join( str(prior_line[key]) for key in infile.header ) + "\n" ) |
if (runInDebugMode == "true") { | ||
beVerbose = T | ||
} else { | ||
beVerbose = F | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional assignment to beVerbose
can be simplified to a single line. This reduces complexity and improves readability.
- if (runInDebugMode == "true") {
- beVerbose = T
- } else {
- beVerbose = F
- }
+ beVerbose = runInDebugMode == "true"
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
if (runInDebugMode == "true") { | |
beVerbose = T | |
} else { | |
beVerbose = F | |
} | |
beVerbose = runInDebugMode == "true" |
if args.hap_file.split('.')[-1] == 'gz': | ||
hap_file = gzip.open( args.hap_file, "r" ) | ||
else: | ||
hap_file = open( args.hap_file, "r" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script opens files but does not close them. This can lead to resource leaks. Consider using the with
statement to ensure files are properly closed after they are no longer needed.
with open(args.hap_file, "r") as hap_file:
# rest of the code
while hap_line and int(hap_line[1]) < int(vcf_line[1]): | ||
hap_line = hap_file.readline() | ||
|
||
if hap_line: | ||
hap_line = hap_line.rstrip().split() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script reads lines from the haplotype file in a loop until it finds a line with a position that matches the current VCF line. If the haplotype file is large, this could be inefficient. Consider using a more efficient data structure, such as a dictionary or a binary search tree, to store the haplotype data.
if is_comment_line(vcf_line): | ||
if vcf_line.startswith("#CHROM"): | ||
vcf_line = vcf_line.rstrip().split("\t") | ||
vcf_line.append('sample0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "fake" sample name 'sample0' is hardcoded. Consider making this configurable through a command-line argument for flexibility.
- parser.add_argument('--out_file', help='Out .vcf file')
+ parser.add_argument('--out_file', help='Out .vcf file')
+ parser.add_argument('--sample_name', default='sample0', help='Sample name to be added')
...
- vcf_line.append('sample0')
+ vcf_line.append(args.sample_name)
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
vcf_line.append('sample0') | |
parser.add_argument('--out_file', help='Out .vcf file') | |
parser.add_argument('--sample_name', default='sample0', help='Sample name to be added') | |
... | |
vcf_line.append(args.sample_name) |
if (annotatePlotsWithGenes & is.data.frame(geneAnnotations)) { | ||
genesSubset = geneAnnotations[geneAnnotations$chr==chromosome,] | ||
genesSubset = genesSubset[order(genesSubset$start),] | ||
|
||
if (nrow(genesSubset)>0) { | ||
genesSubset$yOffset = 0 | ||
# determine yPos offset | ||
if (nrow(genesSubset)>1) { | ||
genesSubset$distance = c(NA,sapply(2:nrow(genesSubset), function(j) { | ||
genesSubset[(j),"start"] - genesSubset[(j-1),"start"] | ||
})) | ||
genesSubset$yHasOffset = genesSubset$distance < 13e6 | ||
genesSubset$yHasOffset[is.na(genesSubset$yHasOffset)] = F | ||
for (j in 2:nrow(genesSubset)) { | ||
if (genesSubset[j,"yHasOffset"]) { | ||
genesSubset[j,"yOffset"] = genesSubset[j-1,"yOffset"]+1 | ||
} | ||
} | ||
} | ||
|
||
genesSubset$position = (genesSubset$start+(genesSubset$end - genesSubset$start)/2)/10/xtotal | ||
|
||
if (!unconditionalGeneLabeling) { | ||
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T) | ||
merged = mergeByOverlaps(genesSubset.gr, seg.gr) | ||
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x | ||
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T) | ||
genesSubset = as.data.frame(genesSubset[index.affectedSegments,]) | ||
} | ||
if (nrow(genesSubset)>0) { | ||
p <- p + geom_vline( xintercept=genesSubset$position, col = "grey80", lty=5, lwd =0.2 ) | ||
p <- p + geom_text(data=genesSubset, aes(x=position, y=ymaxcov+0.2-0.08*ymaxcov*yOffset, label = gene), cex = 4, col = "red") | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code block for gene annotation is quite large and complex. Consider breaking it down into smaller, more manageable functions for better readability and maintainability.
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T) | ||
merged = mergeByOverlaps(genesSubset.gr, seg.gr) | ||
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x | ||
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T) | ||
genesSubset = as.data.frame(genesSubset[index.affectedSegments,]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code checks for unconditionalGeneLabeling
and then performs operations if it's not set. This could be confusing. Consider reversing the logic for better readability.
- if (!unconditionalGeneLabeling) {
+ if (unconditionalGeneLabeling) {
+ # Skip the operations
+ } else {
Commitable suggestion
[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T) | |
merged = mergeByOverlaps(genesSubset.gr, seg.gr) | |
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x | |
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T) | |
genesSubset = as.data.frame(genesSubset[index.affectedSegments,]) | |
} | |
if (unconditionalGeneLabeling) { | |
# Skip the operations | |
} else { | |
genesSubset.gr = makeGRangesFromDataFrame(genesSubset, keep.extra.columns = T) | |
merged = mergeByOverlaps(genesSubset.gr, seg.gr) | |
eventsPerGene.list = aggregate(merged$CNA.type, by=list(merged$gene), FUN=paste)$x | |
index.affectedSegments = grep(pattern = "DEL|LOH|DUP|HomoDel", x = eventsPerGene.list, perl = T) | |
genesSubset = as.data.frame(genesSubset[index.affectedSegments,]) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, can you replace the ARGV
access by something more semantic?
my $phasingOutDir, $unphasedGenotypesFilePrefix, $unphasedGenotypesFileSuffix, $chrPrefix = @ARGV;
Then also CodeRabbit's correct error remark will disappear.
Replacing Impute2 with Beagle
Support for hg19 and hg38 reference genomes
GRCh38_decoy_ebv_phiX_alt_hla_chr.fa
is supported now along with the older hg19.Summary by CodeRabbit
#!/usr/bin/python
to#!/usr/bin/env python
in several scripts for better portability.