CODECsuite

Setup

Build the docker image

From the project root directory run:

docker build -t sysmed/codec:1.1.4 .

Install prerequisites

Install pixi
Install the dependencies:

pixi global install -c conda-forage -c bioconda \
   python=3.7 \
   snakemake=7.3.8 \
   pandas=1.3.3

If you prefer, you can also use env.yml to install these dependencies in a virtual environment, or use an alternative package manager (conda, mamba).

Prepare the reference

Obtain the desired FASTA file (see this blog post for the suggested human reference to use).

For example, for GRCh38:

$ ref_folder="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids"
$ ref_prefix = "GCA_000001405.15_GRCh38_no_alt_analysis_set"
$ wget -O - ${ref_folder}/${ref_prefix}.fna.gz | gunzip > ${ref_prefix}.fna
$ wget -O - ${ref_prefix}.fna.bwa_index.tar.gz | tar -xzvf -

Create .fai and .dict files:

docker run -v $(pwd):/data it sysmed/codec:1.1.4

$ prefix = "GCA_000001405.15_GRCh38_no_alt_analysis_set"
$ cd /data \
   && samtools faidx ${prefix}.fna \
   && samtools dict -o ${prefix}.dict ${prefix}.fna

Additional files

High-confident regions: ideally this would be sample-specific, but you can also use a generic file for your reference. For GRCh38, a suitable can be obtained from Genome in a Bottle.

This file also needs to be converted into interval list format

docker run -v $(pwd):/data it sysmed/codec:1.1.4
$ picard BedToIntervalList \
   I=/data/GRCh38_notinalldifficultregions.bed \
   O=/data/GRCh38_notinalldifficultregions.interval_list \
   SD=/data/GCA_000001405.15_GRCh38_no_alt_analysis_set.dict

VCF with common variants to ignore. For GRCh38, a suitable file can be obtained from dbSNP.

Run the pipeline

Prepare the config.yml, input.tsv, and sample sheet files as described in [snakemake/README.md].

Note: due to a coding error, batch names must contain at least one non-numeric character.

Tne ref parameter should specify the path to the FASTA file from the previous step, with the assumption that the .dict, .fai, and BWA-related files all have the same prefix and are in the same directory.

snakemake \
   --snakefile snakemake/AdapV2/wgs/Snakefile \
   --configfile config.yaml

Legacy

UPDATES

(01/08/25) Version 1.1.4 introduces a new script codec filter designed to filter consensus BAM files. It retains only the reads and bases relevant for variant calling. Fragments (read-pairs) that do not pass fragment-level filtering are excluded from the output BAM. Bases that fail the filters are assigned a minimum base quality score (Q2), ensuring they are ignored by most coverage analysis and variant calling tools. It can be run as the following:

codec filter -b mol_consensus.sorbybyname.bam -o duplex_only.bam -r reference.fa -q 30 -m 60 -Q 0.7 -B 0.5 -N 0.05 ...

Note

The input BAM file must be sorted by read name, and the output BAM will also be query-name sorted. For consistent filtering results, it is recommended to use the same parameters as those in codec call.

CODECsuite

The CODEC analysis pipeline, CODECsuite, comprises five key steps: demultiplexing, adapter trimming, alignment, duplicate collapsing, and single-fragment mutation calling. Duplicate collapsing and alignments are performed using the third-party tools Fgbio and BWA, respectively. After removing byproducts and applying fragment-level filtering, mutations were identified exclusively from duplexes in the overlapped regions, where bases from each read align and match. Bases within these regions underwent stringent filtering based on criteria such as base quality, proximity to fragment ends, overlap with germline mutations, and other factors. Notably, a single read pair is sufficient to form a duplex, as each read represents one strand. Refer to the paper for more details.

Installation

Tested on Red Hat 7 and Ubuntu 18.04

prerequisite for C++ based programs. For snakemake workflow check out here

git
tested with gcc 5.2 and 7.3 with c++14 support
cmake 3.18.3 or above

First, recursive clone the repo and create a build directory which will holds the installion files and final executables.

git clone --recursive [email protected]:broadinstitute/CODECsuite.git && cd CODECsuite && mkdir build

Next, build the program with cmake.

cd build && cmake .. && make

After this, you should be able to see an executable named codec in the build folder you just created.

Demultiplexing

CODECsuite is expected to work with raw lane-level fastq.gz. This can be obtained from illumina bcl2fastq. The first step is demultiplexing and it requires a sample sheet in csv format for each lane which looks the following. Currently, we have used 12 barcodes. For good cluster generation, we recommend to have at least 4 sample barcodes per sequencing lane.

SampleName	IndexBarcode1	IndexBarcode2
Sample01	CTTGAACGGACTGTCCAC	CACCGAGCGTTAGACTAC
Sample02	GAGCCTACTCAGTCAACG	GTGTCGAACACTTGACGG
Sample03	AGCTTGTAAGGCAGGTTA	ACTGATCTTCAGCTGACT
Sample04	TCAAGCGTCTTACATGGT	TGAATCTGAGGCACTGTA
Sample05	CTGGTCCAAGAACGTCTG	CTCTGAACGATCGAGCTC
Sample06	GATCCAGTTCTGTCGAGC	GAGGTGCATGCACCTTAG
Sample07	ACCTATAGGTGCAACGAA	ACTAACTTCCATTGCACT
Sample08	TGAAGGTCCACTGTATCT	TGACCTGGATGGATAGGA
Sample09	CACTGCTTCGAGACGAAG	CTCCAGTTACTGAGACGG
Sample10	GTGATACCTCGATGCTCC	GAGGTCCAGTCTCTGTCC
Sample11	ACTCAGAGAACTCATGGA	ACTACAGGTGGATCCAAT
Sample12	TGAGCTGAGTTCGTACTT	TGATGTACCAACGATGTA

codex demux -1 reads.r1.fastq.gz -2 reads.r2.fastq.gz -p sample_sheet.csv -o demux_outprefix

Given the toy sample_sheet.csv and code this command will generate

demux_outprefix.sample_A.1.fastq.gz, demux_outprefix.sample_A.2.fastq.gz
demux_outprefix.sample_B.1.fastq.gz, demux_outprefix.sample_B.2.fastq.gz

Adapter trimming

After demultiplexing CODEC reads still contain in-situ sample barcode and adapter sequences. The next step is to trim these out since they could interfere alignment

codec trim -1 demux_outprefix.sample_A.1.fastq.gz -2 demux_outprefix.sample_A.2.fastq.gz -o trim_outprefix -u 3 -U 3 -f 2 -t 2 -s sample_A

This tells the CODECsuite that first 3bp of a read is the UMI and to trim off the next two 2bp. The output files of the adapter trimming step looks like

trim_outprefix.sample_A.trim.bam
trim_outprefix.sample_A.trim.log

By default, single-end byproducts are also output to the trim.bam. To split the output use -S/--split_bam_output.

The bam file is standard uBam (unmapped bam) with additional tags

RX: UMI sequence from R1 and R2, concatenated by a hyphen
QX: UMI quality scores
bc: Index barcode sequence
s5: 5' adapter sequence (same as Index barcode)
q5: 5' adapter quality scores
s3: 3' adapter sequence (same as Index barcode of the mate)
q3: 3' adapter quality scores
sl: the rest of 3' adapter  sequence
ql: the rest of 3' adapter quality scores

After adapter trimming. The codec reads can be mapped by standard NGS tools such as BWA. For our end-to-end pipeline please see snakemake.

Single fragment caller (SFC) and mutation rate computation

After GATK best practice (alignment, markduplicate, indel realignment) for example. Of note, BQSR should NOT be run for CODEC data since CODEC has a different quality score distribution. I do not recommend BQSR in general since the modern Illumina sequencers' quality scores having been improved and BQSR almost doubles the bam size.

Now, we can run SFC to call mutations. SFC is designed to call somatic mutations. For the best results, we need to have a bed file which contains the high confident regions (e.g., this) in the reference genome and a germline bam for masking the germline variants. If there is no germline bam, it is recommend to have a germline vcf file. The population based vcf (e.g., dbsnp) is almost surely recommended since it can account for contamination and low sequencing depth of germline bam. There is a set of fragment level and base level filters to improve the precision of the mutation calls at the cost of data loss and potential loss of real mutations. Depends on the applications, we have presets of parameters: -p/--preset

stringent: setting where high precision calling is needed. Situations like calling background mutation rate in white
blood cells

lenient: setting where certain sensitivity is prefer. situations like calling cancer mutations in tumor biopsy or high
tumor fraction ctDNA samples.

null: as little as filtering possible. For advanced users who want to filtering on there own ends.

It is highly recommended that a user try and play with different parameters and figure out the ones that are best for themselves. Trimming the end -d 12 and use Q30 cutoff -q 30 is always recommended. The others are ad hoc. However, the most effective parameter is probably -Q/--min_passQ_frac: the fraction of Q30+Q30 bases in the overlap region. The fraction essentially measures the cluster quality, which is important for single fragment calling. Some examples of running SFC

codec call -b input.mark_duplicated.bam -L highconfidentregions.bed  -r hg19.fa -n germline.bam -p lenient -o output

The output of the SFC are

output.mutation_metrics.txt: includes SNV_rate, INDEL_rate and etc.
output.variatns_called.txt: mutations from single fragments
output.context_count.txt: trinucleotide context and dinucleotide context counts

Other notes

For CODEC-MSI please refer to msi. And by default CMAKE will not build CODEC-MSI. Please uncomment the last two lines if you indeed want to build CODEC-MSI
The Snakemake is hard-coded to de-multiplex 4 lanes simultaneously (e.g. for NovaSeq 6000). If you need to de-multiplex less #lanes (e.g. for NovaSeq SP), comment out entire rules for DemuxL3 and DemuxL4. If you have more than 4 lanes (e.g. HiSeq X) either do 4 lane at a times or add more rules yourself.
The Snakemake pipeline setup file qsub_wrapper.py is specific to UGE. You may need to change settings for your computing environment.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
bbcpputil		bbcpputil
include		include
msi		msi
obsolete		obsolete
snakemake		snakemake
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODECsuite_TPC_Notice_Final.txt		CODECsuite_TPC_Notice_Final.txt
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
accuracy.cpp		accuracy.cpp
codec.cpp		codec.cpp
demux.cpp		demux.cpp
duplex_filter.cpp		duplex_filter.cpp
env.yml		env.yml
trimadapter.cpp		trimadapter.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Build the docker image

Install prerequisites

Prepare the reference

Additional files

Run the pipeline

Legacy

CODECsuite

Installation

Demultiplexing

Adapter trimming

Single fragment caller (SFC) and mutation rate computation

Other notes

About

Releases

Packages

Languages

License

fulcrumgenomics/CODECsuite

Folders and files

Latest commit

History

Repository files navigation

Setup

Build the docker image

Install prerequisites

Prepare the reference

Additional files

Run the pipeline

Legacy

CODECsuite

Installation

Demultiplexing

Adapter trimming

Single fragment caller (SFC) and mutation rate computation

Other notes

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages