Preprocessing Workflow:

Convert maf files into mutation matrices

The MAF mutation format contains the somatic mutations (e.g. indels, SNVs) within a tumor sample. This is the maf file, downloaded from https://confluence.broadinstitute.org/display/GDAC/MAF+Dashboard.

In the conversion, one can either represent mutated genes by their names, or annotate them with the suffix "loss" (short for loss-of-function). Using the "loss" suffix, one can treat the SNV mutation as the same as the Copy-Number loss mutation when running COFFDROP.

To convert a sample MAF file to mutation matrix:

export MAFFILE=$COFFDROPFOLDER/data/ACC_broad/ACCBroadMAF.txt export OUTPUTFILE=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-som.m2 cd $COFFDROPFOLDER/preprocessing python maf2matrix.py -i $MAFFILE -o $OUTPUTFILE

This will create the mutation matrix "ACC_broad-som.m2" in data/ACC_broad. Som is short for somatic.

To convert while annotating each gene as loss, run:

export MAFFILE=$COFFDROPFOLDER/data/ACC_broad/ACCBroadMAF.txt export OUTPUTFILE=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-som-l.m2 cd $COFFDROPFOLDER/preprocessing python maf2matrix.py -i $MAFFILE -o $OUTPUTFILE -al 1

The "som-l" represents that somatic mutations are annotated as (l)oss. This is purely for convenience.

Convert GISTIC files into mutation matrices

GISTIC files represent the copy number change of genes in the tumor sequences. We use in particular the file "all_thresholded.by_genes.txt" generated by GISTIC 2.0. This is the GISTIC thresholded file, downloaded (along with other GISTIC files) from http://gdac.broadinstitute.org/runs/analyses__latest/. Choose only to download the folder with “CopyNumber_Gistic2.Level_4” in its name.

Genes with copy Number gains are annotated with the suffix "gain" (e.g. "TP53gain"), while those with losses are annotated with "loss".

To convert GISTIC files to mutation matrices, run:

GISTICFILE=$COFFDROPFOLDER/data/ACC_broad/ACCGISTIC.txt OUTPUTGAINLOSSFILE=$COFFDROPFOLDER/data/ACC_broad/ACC-cna-gl.m2 cd COFFDROPFOLDER/preprocessing python cna2matrix.py -i $GISTICFILE -o $OUTPUTGAINLOSSFILE -gl 1

This will append "gain" to genes that were gained and "loss" to genes that were lost. "cna" stands for Copy-Number-Alteration.

Convert GISTIC files into Segmented Mutation Matrices

Copy number in adjacent genes is not independent. To prevent results from being filled by adjacent copy-number pairs, one should first segment adjacent genes. Each segment is a string of genes joined by '_'.

This set of commands will convert a GISTIC file into a segmented matrix, ACC_broad-seg-gl.m2. It will also write a gene2seg file to SampleMatrix-seg.m2.gene2seg, mapping each gene to its segment.

GISTICFILE=$COFFDROPFOLDER/data/ACC_broad/ACCGISTIC.txt OUTPUTSEGMENTFILE=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-seg-gl.m2 CONCORDANCE_THRESH=0.995 DISTANCE_THRESH=1000000 cd $COFFDROPFOLDER/preprocessing python cna2matrix.py -i $GISTICFILE -s $OUTPUTSEGMENTFILE -ct $CONCORDANCE_THRESH -dt $DISTANCE_THRESH

Segments are generated by moving down the chromosome and joining the genes together based on the DISTANCE_THRESH and CONCORDANCE_THRESH. Each gene in the segment must be concordant with its neighbors to at least CONCORDANCE_THRESH. Namely, if CONCORDANCE_THRESH=0.995, then 99.5% of the samples must have the same type of alteration for the two genes in question. Each gene in the segment must also be within DISTANCE_THRESH of its immediate neighbors.

Combine mutation matrices

To combine two mutation matrices, run:

MATRIXA=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-som.m2 MATRIXB=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-cna-gl.m2 COMBINEDMATRIX=$COFFDROPFOLDER/data/ACC_broad/ACC_broad-som-cna-gl.m2 cd $COFFDROPFOLDER/preprocessing python matrixreader.py -i $MATRIXA $MATRIXB -o $COMBINEDMATRIX

COFFDROP Workflow:

cd /path/to/COFFDROP
Run jupyter notebook.
Launch COFFDROP_Runs.ipynb.

Postprocessing Workflow: (deprecated, do not use)

Combine output pairs and triplets from different files.

This will take the p-values for each pair from each of the files and list them out. It will also annotate which pairs are unique to each file. export FILE1=$COFFDROPFOLDER/output/comet-run-Mpairs.tsv export FILE2=$COFFDROPFOLDER/output/COFFDROP-run-Mpairs.tsv export COMBINEDFILE=$COFFDROPFOLDER/output/COFFDROP-comet-combined-Mpairs.tsv export GENECOLUMNNAMES="Gene0 Gene1" Each pair has a Gene0 and a Gene1. If integrating triplets, use GENECOLUMNNAMES="Gene0 Gene1 Gene2" export FILENICKNAMES="comet COFFDROP" These are optional shorter nicknames to use to annotate which p-values are from which file. cd $COFFDROPFOLDER/postprocessing python edgeannotator.py -i $FILE1 $FILE2 -o $COMBINEDFILE -gcn $GENECOLUMNNAMES -n $FILENICKNAMES The combined file and unique pairs to each file should be written to the output directory.

Filter pairs by a certain column

export FILE=$COFFDROPFOLDER/output/comet-run-Mpairs.tsv export PVALUETHRESH=0.01 export COLUMN=Probability export GENECOLUMNNAMES="Gene0 Gene1" export LIMITEDFILE=$COFFDROPFOLDER/output/comet-run-Mpairs-limited.tsv cd $COFFDROPFOLDER/postprocessing python limitedges.py -i $FILE -t $PVALUETHRESH -gcn $GENECOLUMNNAMES -o $LIMITEDFILE comet-run-Mpairs-limited.tsv should now contain only pairs with a probability less than 0.01.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!