Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20

Open
wants to merge 78 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
ea53bd3
star plot: green (NCTN) and yellow (DH>1) line instead of common red …
GWarsow Nov 14, 2018
3836328
added PID to covBaf plot filename
GWarsow Nov 28, 2018
bba909d
removed inappropriate cvalues from config xml file
GWarsow Nov 28, 2018
e0d4a50
integrated overseen code (dh_Stop method 'quanile', exit due to 'high…
GWarsow Nov 28, 2018
9586a6a
Merge branch 'master' into develop
GWarsow Nov 28, 2018
3109026
corrected indentation
GWarsow Nov 28, 2018
5e86bfa
added contributing segments file for LST
GWarsow Dec 12, 2018
3b3d7fb
implemented gene annotation in TCN chromosome plots [turned off by de…
GWarsow Jan 4, 2019
68f24bc
added missing parameter (secondChoicePloidyFilnameAddition)
GWarsow Jan 17, 2019
4ba122f
added debugging lines
GWarsow Jan 17, 2019
1dd2cf7
use Mclust with verbose=TRUE when cvalue runInDebugMode is set to "true"
GWarsow Feb 22, 2019
530ba17
quoted email parameter value
Apr 23, 2019
78e1612
Changed the phasing routine by exchanging impute2 with Beagle
Dec 4, 2019
1888521
renamed phasing scripts
Dec 4, 2019
368a3c5
Cleaned phasing procedure. Deleted all files and paths related to imp…
tlkaufmann Dec 4, 2019
1a8069a
Fixed typo
tlkaufmann Dec 4, 2019
de2b5a0
Finished off cleaning of phasing procedure. Renaming of files and too…
tlkaufmann Dec 4, 2019
a96db03
updated parameter.rst table with beagle parameters
tlkaufmann Dec 4, 2019
eff8785
changed download file to include Bealge references
tlkaufmann Dec 5, 2019
e49047a
Additional changes to the phasing routine: adapted the downloadRefere…
tlkaufmann Dec 5, 2019
9600f4b
Adapted configuration file to match the location of the Beagle refere…
tlkaufmann Dec 10, 2019
27451e6
Compiled plugin and changed version number to 6.0.0
tlkaufmann Dec 13, 2019
c8a953f
Fix crash: Included the parameter 'sharedFilesBaseDirectory' in the c…
tlkaufmann Dec 13, 2019
0a6de5a
Changed the location of the Beagle reference files in the configuration
tlkaufmann Dec 13, 2019
fd79e51
Merge branch 'renamed'
tlkaufmann Dec 16, 2019
c9aa6a0
Minor change in phasing.sh to ensure correct handling by Roddy when j…
tlkaufmann Dec 16, 2019
2160ea7
changed documentation
tlkaufmann Dec 16, 2019
f880d0a
Changes to phasing_x.sh for better roddy compatibility
tlkaufmann Dec 16, 2019
f11ed17
Changed tabs to spaces
tlkaufmann Dec 18, 2019
9a41797
better way of checking if line is comment
tlkaufmann Dec 18, 2019
1061eeb
better handling of errors in phasing.sh and phasing_X.sh
tlkaufmann Dec 18, 2019
13bdfe2
cleaned config file of hardcoded paths
tlkaufmann Dec 18, 2019
cfec6c9
Made Beagle helper scripts more readable and included assertion state…
tlkaufmann Dec 18, 2019
944304d
removed references to local file structure from config file
tlkaufmann Dec 19, 2019
72f796b
Improved error handling in phasing.sh and phasing_x.sh with helper fu…
tlkaufmann Dec 19, 2019
232cbd0
improved bash variables in phasing.sh and phasing_X.sh
tlkaufmann Dec 19, 2019
94b4eb4
Changes in the phasing routine to fix crashes of the workflow. Includ…
tlkaufmann Dec 23, 2019
081e236
Merge pull request #16 from tomkau93/master
Dec 29, 2019
524037d
Changed config file to accomodate hg38 reference data. Also changed a…
tlkaufmann Dec 29, 2019
14954cd
Merge pull request #17 from tomkau93/hg38
Dec 29, 2019
a2e30c1
homodel artifact liftover
NagaComBio Feb 15, 2021
df9e662
change in env path
NagaComBio Feb 15, 2021
778079b
removing quotes around OPTs
NagaComBio Feb 15, 2021
118f4c6
change in chr length file
NagaComBio Feb 15, 2021
87f6439
config update
NagaComBio Feb 15, 2021
209a232
updating job mem request
NagaComBio Feb 15, 2021
5e7fdbc
CRAM support with samtools/bcftools update
NagaComBio Feb 23, 2021
e8a7fc7
potential homodel artifacts
NagaComBio Apr 30, 2021
5004ae1
Variant calling from bcftools
NagaComBio Apr 30, 2021
a8b2f55
fixing typo
NagaComBio Apr 30, 2021
19b3100
Increasing beagle's memory & core allocation
NagaComBio Apr 30, 2021
5c4ad09
Adapting XML for hg19 config
NagaComBio Apr 30, 2021
3d16416
New hg38 config XML
NagaComBio Apr 30, 2021
346e68a
parameterizing Beagle resources
NagaComBio May 19, 2021
b139c9a
Merge pull request #18 from NagaComBio/hg38
NagaComBio May 26, 2021
9b86c36
README & scripts to prepare hg38 reference files
NagaComBio Jun 14, 2021
52c6c5a
Based on code-review suggestions
NagaComBio Jun 14, 2021
7b0d1ff
Merge branch 'hg38' of https://github.com/DKFZ-ODCF/ACEseqWorkflow in…
NagaComBio Jun 14, 2021
f52f086
fixing typo in GRCh38_related_files/README
NagaComBio Jun 14, 2021
e6a5845
Add '--comment' to tabix scripts
NagaComBio Jun 28, 2022
30555e1
Update LSF memory requirements
NagaComBio Jun 28, 2022
09bd58f
Merge branch 'master' into hg38
NagaComBio Jul 1, 2022
fe1b82b
Add temp hg38 centromer file
NagaComBio Aug 3, 2022
d5706f9
Upgrade COWorkflowsBasePlugin to 1.3.0
NagaComBio Aug 3, 2022
232a81d
Merge branch 'develop' into hg38
NagaComBio Aug 3, 2022
e3fc99f
update the hg38 centromer file
NagaComBio Oct 6, 2022
1229a24
Remove the NULL line in the pruned file
NagaComBio Oct 6, 2022
eb26938
Add druggable gene list
NagaComBio Oct 6, 2022
e783d7e
Update GRCh38 installation files
NagaComBio Oct 6, 2022
6549fe0
Update hg38 mappability files to M2E2 with ALT & HLA contigs
NagaComBio Oct 6, 2022
779d5b0
update homodel artifact list
NagaComBio Oct 6, 2022
fc26eb6
update the shebang lines
NagaComBio Oct 6, 2022
8655540
Update README
NagaComBio Oct 6, 2022
8a29539
Remove beagle jar & update beagle to 5.2
NagaComBio Oct 31, 2022
8ffd230
Update COWorkflowsBasePlugin to 1.4.2
NagaComBio Oct 31, 2022
2d0ec65
Update ref file location
NagaComBio Mar 23, 2023
8eda5e9
Fix tmpHaploblocks unbound variable issue
NagaComBio Mar 23, 2023
ab85dc9
Fix for NoControl workflow
NagaComBio Nov 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ build/
*.swp
*.DS_Store
#.+
hg19_GRCh37_1000genomes
Binary file modified ACEseqWorkflow.jar
100755 → 100644
Binary file not shown.
File renamed without changes.
4 changes: 4 additions & 0 deletions README.ACEseqWorkflow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ isNoControlWorkflow false Run analysis with matching control and estim

== Changelist

* Version update 6.0.0
- Changed the phasing routine. The program "impute2" was replaced by "Beagle". Files and tools were renamed accordingly.
- Generally renamed all tools and files from "imputeGenotype" to "phaseGenotype" (and so on) as the subroutine does not actually perform imputation but rather phasing.

* Version update to 5.0.1
- fixed density(NA) bug and index bug for frequencies (as.character) in clustering step

Expand Down
2 changes: 1 addition & 1 deletion buildversion.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
5.0
6.0
0
2 changes: 1 addition & 1 deletion documentation/source/methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Methods - Theory
================


| ACEseq can be used to estimate copy-numbers from WGS data using a tumor vs. control approach. Thus a pre-requesite is WGS data from healthy tissue and tumor tissue of the same patient with at least 30x coverage. Samtools [] mpileup is used to determine the coverage for tumor and control sample - position specific for each single nucleotide polymorphism (SNP) position recorded in dbSNP and per 1 kb window. To get chromosome specific allele frequencies, the genotypes of SNP positions are phased with Impute2 [] and A and B allele are assigned accordingly. Haploblocks are defined as regions with consecutively phased SNPs. Subsequently, B-allele frequencies (BAFs) are estimated for all SNP positions in tumor and control with sufficient coverage in the control:
| ACEseq can be used to estimate copy-numbers from WGS data using a tumor vs. control approach. Thus a pre-requesite is WGS data from healthy tissue and tumor tissue of the same patient with at least 30x coverage. Samtools [] mpileup is used to determine the coverage for tumor and control sample - position specific for each single nucleotide polymorphism (SNP) position recorded in dbSNP and per 1 kb window. To get chromosome specific allele frequencies, the genotypes of SNP positions are phased with Impute2 (up to v5) or Beagle (since v6) [] and A and B allele are assigned accordingly. Haploblocks are defined as regions with consecutively phased SNPs. Subsequently, B-allele frequencies (BAFs) are estimated for all SNP positions in tumor and control with sufficient coverage in the control:

.. math::
\label{eq:BAF}
Expand Down
30 changes: 9 additions & 21 deletions documentation/source/parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Multiple parameters can be set with ACEseq though not all are necessary to chang
svOutputDirectory,${outputAnalysisBaseDirectory}/SV_calls,path,
crestOutputDirectory,${outputAnalysisBaseDirectory}/crest,path,
cnvSnpOutputDirectory,${aceseqOutputDirectory}/cnv_snp,path,
imputeOutputDirectory,${aceseqOutputDirectory}/phasing,path,
phasingOutputDirectory,${aceseqOutputDirectory}/phasing,path,
plotOutputDirectory,${aceseqOutputDirectory}/plots,path,
runWithoutControl,false,boolean,use control for analysis (false|true)
minHT,5,integer,minimum number of consecutive SNPs to be considered for haploblocks
Expand Down Expand Up @@ -41,7 +41,7 @@ Multiple parameters can be set with ACEseq though not all are necessary to chang
min_distance,0.05,float,min_distance
haplogroupFilePrefix,haploblocks_chr,string,prefix for file with haplogroups per chromosome
haplogroupFileSuffix,txt,string,suffix for file with haplogroups per chromosome
haplogroupFilePath,${imputeOutputDirectory}/${haplogroupFilePrefix},path,
haplogroupFilePath,${phasingOutputDirectory}/${haplogroupFilePrefix},path,
min_length_purity,1000000,integer,minimum length of segments to be considered for tumor cell content and ploidy estimation
min_hetSNPs_purity,500,integer,minimum number of control heterozygous SNPs in segments to be considered for tumor cell content and ploidy estimation
dh_stop,max,string,
Expand Down Expand Up @@ -72,18 +72,9 @@ Multiple parameters can be set with ACEseq though not all are necessary to chang
PATIENTSEX,male,string,patient sex used in case of no control workflow (male|female|klinefelter)
CNV_ANNO_SUFFIX,cnv.anno.tab.gz,string,suffix for mappability annotated chromosome-wise 1kb coverage files
CNV_SUFFIX,cnv.tab.gz,string,suffix chromosome-wise 1kb coverage files
FILE_UNPHASED_PRE,${imputeOutputDirectory}/${unphasedGenotypesFilePrefix},path,
FILE_UNPHASED_GENOTYPE,${imputeOutputDirectory}/unphased_genotype_chr,path,
FILE_PHASED_PRE,${imputeOutputDirectory}/${phasedGenotypesFilePrefix},path,
FILE_PHASED_GENOTYPE,${imputeOutputDirectory}/phased_genotype_chr,path,
FILE_INFO,info,string,
FILE_INFO_SAMPLE,info_by_sample,string,
FILE_HAPS,haps,string,
FILE_HAPS_CONF,haps_confidence,string,
FILE_SUMMARY,summary,string,
FILE_WARNINGS,warnings,string,
FILE_PART,part,string,
FILE_SAMPLE_G,${imputeOutputDirectory}/sample_g.txt,path,sample_g file used by imputation on X chromosome for females
FILE_UNPHASED_PRE,${phasingOutputDirectory}/${unphasedGenotypesFilePrefix},path,
FILE_PHASED_GENOTYPE,${phasingOutputDirectory}/phased_genotype_chr,path,
FILE_SAMPLE_G,${phasingOutputDirectory}/sample_g.txt,path,sample_g file used by imputation on X chromosome for females
vinjana marked this conversation as resolved.
Show resolved Hide resolved
MALE_FAKE_CONTROL_PRE,${pathToACEseqResults}/cnv_snp/${pid}.chr,path,path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for male patients
FEMALE_FAKE_CONTROL_PRE,${pathToACEseqResults}/cnv_snp/${pid}.chr,path,path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for female patients
PLOT_PRE,${aceseqOutputDirectory}/${pid}_plot,path,
Expand All @@ -104,16 +95,13 @@ Multiple parameters can be set with ACEseq though not all are necessary to chang
CHROMOSOME_LENGTH_FILE,${path}/chrlengths.txt,path,
REPLICATION_TIME_FILE,${path}/ReplicationTime_10cellines_mean_10KB.Rda,path,"replication timing file"
GC_CONTENT_FILE,${path}/hg19_GRch37_100genomes_gc_content_10kb.txt,path,
GENETIC_MAP_FILE,${path}/genetic_map_chr${CHR_NAME}_combined_b37.txt,path,"impute files"
KNOWN_HAPLOTYPES_FILE,${path}/ALL.chr${CHR_NAME}.integrated_phase1_v3. 20101123.snps_indels_svs.genotypes.nomono.haplotypes.gz,path,"impute files"
KNOWN_HAPLOTYPES_LEGEND_FILE,${path}ALL.chr${CHR_NAME}.integrated_phase1_v3. 20101123.snps_indels_svs.genotypes.nomono.legend.gz,path,"impute files"
GENETIC_MAP_FILE_X,${path}/genetic_map_chrX_nonPAR_combined_b37.txt,path,"impute files"
KNOWN_HAPLOTYPES_FILE_X,${path}/ALL_1000G_phase1integrated_v3_chrX_nonPAR_impute.hap.gz,path,"impute files"
KNOWN_HAPLOTYPES_LEGEND_FILE_X,${path}/ALL_1000G_phase1integrated_v3_chrX_nonPAR_impute.legend.gz,path,"impute files"
outputExecutionDirectory,${path}/exec_${executionTimeString},,"path to log files"
imputeBaseDirectory,${path}/,path,"directory for impute files"
mergedBamSuffix,merged.mdup.bam,string,"A list of all known suffixes for merged bam files. I.e. merged.dupmark.bam, merged.mdup.bam..."
mergedBamSuffixList,${mergedBamSuffix},string,"A list of all known suffixes for merged bam files. I.e. merged.dupmark.bam, merged.mdup.bam..."
defaultMergedBamSuffix,${mergedBamSuffix},string,The default suffix for merged bam files when they are created by Roddy.
libloc_PSCBS,,string,path to PSCBS library in R
libloc_flexclust,,string,path to felxclust library in R
BEAGLE_REFERENCE_FILE,${baseDirectoryReference}/tools_data/Beagle/chr${CHR_NAME}.1kg.phase3.v5a.b37.bref3,path
BEAGLE_REFERENCE_FILE_X,${baseDirectoryReference}/tools_data/Beagle/chrX.1kg.phase3.v5a.b37.bref3,path
BEAGLE_GENETIC_MAP,${baseDirectoryReference}/tools_data/genetic_maps/plink.chr${CHR_NAME}.GRCh37.map,path
BEAGLE_GENETIC_MAP_X,${baseDirectoryReference}/tools_data/genetic_maps/plink.chrX.GRCh37.map,path
105 changes: 105 additions & 0 deletions installation/GRCh38_related_files/README_downloadReferences_GRCh38.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
### Reference files for GRCh38
vinjana marked this conversation as resolved.
Show resolved Hide resolved

Information on downloading and parsing files needed for GRCh38 reference genome


#### Reference genome

We are using the GRCh38 version hosted at `http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa`,

The above reference genome (GRCh38_decoy_ebv_phiX_alt_hla) can be downloaded and processed using the repositiry at `https://github.com/DKFZ-ODCF/setup-reference-data`.

List of files needed from the above process
- GRCh38_decoy_ebv_phiX_alt_hla_chr.fa
- GRCh38_decoy_ebv_phiX_alt_hla_chr.fa.chrLenOnlyACGT_realChromosomes.tsv
- GRCh38_decoy_ebv_phiX_alt_hla_chr.fa.chrLength.tsv


#### dbSNP version 135

Downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz

Post-processing for usage in ACEseq:
```
DBSNP_VERSION=135
zcat 00-All.vcf.gz |
awk '/^#/{print} /VC=SNV/{ v=$8; sub(/.*dbSNPBuildID=/, "", v); sub(/;.*/, "", v); if (v~/^[0-9]+$/ && int(v)<='$DBSNP_VERSION') print }' |
bgzip > 00-All.SNV.vcf.gz
tabix -p vcf 00-All.SNV.vcf.gz
```

#### Generating mappability track
The mappability file (GRCh38_Mappability_Align_100mer.bedGraph.gz) is created using the `create_mappability.sh` bash script.

The tools used have to be in the `tools` folder. The tools `gem-2-wig`, `gem-indexer`, `gem-mappability` are from the package *gem-tools* from the Vlaams Instituut voor Biotechnologie (VIB).
Information about the tools can be found here: https://wiki.bits.vib.be/index.php/Create_a_mappability_track#Install_and_run_the_GEM_library_tools. The tools were downloaded from https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%202/. The corresponding paper should be this one: https://www.researchgate.net/publication/221776385_Fast_Computation_and_Applications_of_Genome_Mappability

The two tools `wigToBigWig` and `BigWigToBedGraph` are downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

Finally, `tabix` from htslib is used (see http://www.htslib.org/doc/tabix.html)

Run the script `sh create_mappability.sh`


#### Replication time
Replication time from individual cell lines from GRCh37 ENCODE data were lifted over to GRCh38, and averages were re-calculated.
The new R-object is uploaded to the `$repo_root/installation/GRCh38_related_files/time_mean_10KB.Rda`


#### Gaps and centromeres
- `gap.txt.gz` downloaded from `http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/gap.txt.gz`
- `centromeres.txt.gz` downloaded from `http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/centromeres.txt.gz`
- `centromeres_merged.txt.gz` created with the following command
```
zcat centromeres.txt.gz | sort -k 2 -V | awk 'BEGIN {printf "#bin\tchrom\tchromStart\tchromEnd\n"; chr=""} $2!=chr { if (end != ""){printf $1"\t" chr "\t" start "\t" end "\n"}; chr=$2; start=$3} {end=$4} END {printf $1"\t" chr "\t" start "\t" end "\n"}' | gzip > centromeres_merged.txt.gz
```
- `gap_with_centromeres.txt.gz` file created with the following command
```
zcat centromeres_merged.txt.gz | tail -n +2 | awk ' BEGIN {OFS="\t"} {print $0, "1", "N", $4-$3, "centromere", "no"}' | gzip > gap_with_centromeres.txt.gz
```


#### GC content
The GC-content (`gc_content_hg38.txt`) is calculated directly from the reference genome with the script `calc_gc_content.py`.
```
python3 calc_gc_content -v -i ${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa -o gc_content_hg38.txt
```
The -v flag increases verbosity. Note that you have to use python3!


#### Beagle reference files
The variants were downloaded for all the chromosomes mapped to hg38 reference genome
Info: https://www.internationalgenome.org/announcements/Variant-calls-from-1000-Genomes-Project-data-on-the-GRCh38-reference-assemlby/
FTP site: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/

Added 'chr' prefix and converted the VCF files to BREF format using Beagle. Refer to the script `beagle_vcf_to_bref.sh`

This step will generate the following files,
- `ALL.chr${CHR_NAME}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref3`
- `ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref3`


#### Beagle genetic map files
Downloaded from `http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/plink.GRCh38.map.zip`

Add `chr` prefix

```
for chr in `seq 1 22` X ; do cat plink.chr${chr}.GRCh38.map | sed 's/^/chr/' > plink.chr${chr}.GRCh38.CHR.map; done
```


#### Local controls for no-control workflow (optional)
These are lift-over files from the hg19 workflow. New hg38 native files will be generated for the next versions.


#### Exclusion list or blacklist files
These are lift-over files from the hg19 workflow. New hg38 native files will be generated for the next versions.
`ACEseqWorkflow/resources/analysisTools/copyNumberEstimationWorkflow/artifact.homoDels.potentialArtifacts.hg38_liftover.txt`


#### Hg38 cytoband
Cytoband file was copied from ANNOVAR database files. Only the coordinates from Chr1-22, X, and Y were kept.
```
cat ANNOVAR/annovar_April2018/humandb/hg38_cytoBand.txt | grep -v "_" | grep -v "^chrM" > hg38_cytoBand.txt
```
9 changes: 9 additions & 0 deletions installation/GRCh38_related_files/beagle_vcf_to_bref.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

module load java/1.8.0_131
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script uses a specific version of Java (1.8.0_131). If the script is intended to be used in different environments, consider checking for the Java version dynamically or allowing the user to specify the Java version.


for chr in `seq 1 22` X Y
do
echo $chr
(zcat ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz | head -n 300 | grep "#" ; zcat ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz | grep -v "#" | sed 's/^/chr/') | java -jar bref3.18May20.d20.jar > ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref
vinjana marked this conversation as resolved.
Show resolved Hide resolved
done
58 changes: 58 additions & 0 deletions installation/GRCh38_related_files/calc_gc_content.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import sys
vinjana marked this conversation as resolved.
Show resolved Hide resolved
import argparse
import re

assert sys.version_info.major >= 3, "Python 3 is required for string operations"
vinjana marked this conversation as resolved.
Show resolved Hide resolved

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input", help="Input sequence - fasta file format", required=True)
parser.add_argument("-o", "--output", help="Output file name", default='gc_content.txt')
parser.add_argument("-t", "--threshold", help="Threshold", default=0.0)
parser.add_argument("--stepsize", help="Threshold (regions with gc lower than threshold are discarded)", default=10000)
parser.add_argument("-v", "--verbose", action = 'store_true', help="Add verbosity")
args = parser.parse_args()

threshold = float(args.threshold)
stepsize = int(args.stepsize)

output_data = ['\t'.join(['chromosome', 'start', 'end', 'gc_content'])]

if args.verbose: print('Load data')
with open(args.input, "r") as f:
data = f.read().splitlines()

if args.verbose: print('Calculate chromosome stats')
chr_names, chr_borders = [], []

for i in range(len(data)):
if '>' in data[i]:
chr_names.append(data[i].split('>')[1].split(' ')[0])
chr_borders.append(i)
chr_borders = chr_borders + [len(data)]
vinjana marked this conversation as resolved.
Show resolved Hide resolved

for i in range(len(chr_names)):

if not re.match("^(chr)?[0-9,X,Y]{1,2}$", chr_names[i]):
continue

if args.verbose: print('Calculating Chromosome {} - length = {} aa'.format(chr_names[i], 61*(chr_borders[i+1] - chr_borders[i])))
vinjana marked this conversation as resolved.
Show resolved Hide resolved
current_chr = data[chr_borders[i]:chr_borders[i+1]][1:]

current_chr = ''.join(current_chr)

for j in range(0, len(current_chr)//stepsize):
current_data = current_chr[(stepsize*j):(stepsize*(j+1))]

gc = len(re.findall('[gcGC]', current_data))
gcta = len(re.findall('[gctaGCTA]', current_data))
if gcta == 0: gcta = 1.
gc_content = gc/gcta

if gc_content > threshold:
output_data.append('\t'.join([chr_names[i], str(stepsize*j), str(stepsize*(j+1)), '{:1.6f}'.format(gc_content)]))
if args.verbose: print('Finished with calculation. Writing to file {}'.format(args.output))

with open(args.output, 'w') as f:
for line in output_data:
f.write("%s\n" % line)
if args.verbose: print('GC-content calculated succesfully')
29 changes: 29 additions & 0 deletions installation/GRCh38_related_files/create_mappability.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

vinjana marked this conversation as resolved.
Show resolved Hide resolved
nr_cores=4
kmer=100
reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference_genome variable is used before it is set. You should define reference_path before this line.

- reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"
+ reference_path="/path/to/reference"
+ reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"
reference_path="/path/to/reference"
reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"


mkdir -p ref out
# copy the reference but only chr1-22, X and Y
echo "copy reference genome"
awk '/^>chrM/ {exit} /^>/ {print $1} !/^>/ {print}' ${reference_genome} > ref/reference.fa
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line assumes that the reference genome file is formatted in a specific way. If the format changes, this line may not work as expected. Consider adding a comment to explain the expected format of the reference genome file.

echo "gem-indexer"
./tools/gem-indexer -T ${nr_cores} -c dna -i ref/reference.fa -o out/index
echo "gem-mappability"
./tools/gem-mappability -T ${nr_cores} -I out/index.gem -l ${kmer} -o out/outfiles
echo "gem-2-wig"
./tools/gem-2-wig -I out/index.gem -i out/outfiles.mappability -o out/outfiles
echo "wigToBigWig"
./tools/wigToBigWig out/outfiles.wig out/outfiles.sizes out/outfiles.bigWig
echo "bigWigToBedGraph"
./tools/bigWigToBedGraph out/outfiles.bigWig out/outfiles.bedGraph

echo "Filter lines and compress"
awk '$4 > 0.0 {print $0}' out/outfiles.bedGraph | ./tools/bgzip > GRCh38_Mappability_Align_100mer.bedGraph.gz

echo "Create Index with Tabix"
./tools/tabix -p bed GRCh38_Mappability_Align_100mer.bedGraph.gz

echo "cleaning"
rm -r ref out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script removes the ref and out directories at the end. If these directories existed before the script was run, their contents will be lost. Consider creating unique temporary directories instead.

- mkdir -p ref out
+ ref_dir=$(mktemp -d -t ref-XXXXXXXXXX)
+ out_dir=$(mktemp -d -t out-XXXXXXXXXX)

And replace all ref and out with ${ref_dir} and ${out_dir} respectively. At the end, replace rm -r ref out with rm -r ${ref_dir} ${out_dir}.


Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Make sure it correctly replaces the highlighted code, has no missing lines and indentaion issues.

Suggested change
rm -r ref out
ref_dir=$(mktemp -d -t ref-XXXXXXXXXX)
out_dir=$(mktemp -d -t out-XXXXXXXXXX)
# Replace all 'ref' and 'out' in your script with '${ref_dir}' and '${out_dir}' respectively
rm -r ${ref_dir} ${out_dir}

Binary file not shown.
Loading