Plots_BRCA.Rmd

---
title: "Paper_BRCA"
author: "Evelin Gonzalez"
output: rmarkdown::github_document
date: "`r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
#library(tidyverse)
library(ggplot2)
library(scales)
library(FactoMineR)
library(factoextra)
library(cowplot)
library(maftools)
library(readr)
library(tidyr)
setwd("/mnt/beegfs/home/efeliu/work2024/080524_nextflow_BRCA/HRR_RunBRCA/plots_paper/BRCA12_ms")

```

```{r setup, include=FALSE}

deepvariant.single.pass<-read.csv("input/DeepVariant_single_variants_PASS_exonic.tsv",sep = "\t", header = TRUE)
deepvariant.single.pass <- deepvariant.single.pass %>% separate(Otherinfo13, c("GT", "GQ", "DP", "AD", "VAF", "PL"), "[:]", extra = "drop")
deepvariant.single.pass <- deepvariant.single.pass %>% mutate(DP = as.numeric(DP)) %>% filter(DP > 30)
ln.deepvariant.single.pass = read.maf(maf = deepvariant.single.pass) 

deepvariant.pool.pass<-read.csv("input/DeepVariant_pool_variants_PASS_exonic.tsv",sep = "\t", header = TRUE)
deepvariant.pool.pass <- deepvariant.pool.pass %>% mutate(DP = as.numeric(DP)) %>% filter(DP > 30)
ln.deepvariant.pool.pass = read.maf(maf = deepvariant.pool.pass)

strelka.single.pass<-read.csv("input/Strelka_single_variants_PASS_exonic.tsv",sep = "\t", header = TRUE)
strelka.single.pass <- strelka.single.pass %>% separate(Otherinfo13, c("GT", "GQ", "GQX", "DP", "DPF", "AD"), "[:]", extra = "drop")
strelka.single.pass <- strelka.single.pass %>% mutate(DP = as.numeric(DP)) %>% filter(DP > 30)
ln.strelka.single.pass = read.maf(maf = strelka.single.pass)

strelka.pool.pass<-read.csv("input/Strelka_pool_variants_PASS_exonic.tsv",sep = "\t", header = TRUE)
strelka.pool.pass <- strelka.pool.pass %>% mutate(DP = as.numeric(DP)) %>% filter(DP > 30)
ln.strelka.pool.pass = read.maf(maf = strelka.pool.pass)

```

## Figure 2

```{r plotSummary, echo=FALSE, include=FALSE}
vcNames <-c("Frame_Shift_Del","Frame_Shift_Ins","Missense_Mutation","Nonsense_Mutation","Silent")
summary = read.maf(maf = strelka.single.pass, vc_nonSyn = vcNames)
```

```{r plotSummary2, echo=FALSE}
png(filename = "Plots_BRCA_files/V2/fig2.png", height = 6, width = 8, res = 1000, units = "in")
plotmafSummary(maf=summary, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE, showBarcodes = FALSE, textSize = 0.6)
dev.off()
```

**Figure 2: Summary of germline mutations found in BC patients.** This figure presents a comprehensive summary of the mutations identified in BC patients. It includes a bar plot categorizing the variants by classification (silent, missense, and frameshift insertion) and type (SNP and INS). The figure also shows the number of variants per sample, variant type, and the nature of the genetic change. The number of variants in each sample is d|isplayed asx a stacked bar plot, while the variant types are summarized in a box plot according to their classification. Finally, we show the percentage of samples with mutations in each gene.

## Figure 3

```{r varcallers, echo=FALSE}

## Strelka vs Deepvariant - BRCA1
png(filename = "Plots_BRCA_files/V2/fig3A.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 = ln.strelka.single.pass, m2 = ln.deepvariant.single.pass, gene = "BRCA1", AACol1 = "aaChange", AACol2 = "aaChange", m1_name = "strelka", m2_name = "deepvariant",showDomainLabel = FALSE, refSeqID = 'NM_007294')
dev.off()

### Strelka vs Deepvariant - BRCA2
png(filename = "Plots_BRCA_files/V2/fig3B.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 =ln.strelka.single.pass, m2 = ln.deepvariant.single.pass,, gene = "BRCA2", AACol1 = "aaChange", AACol2 = "aaChange", m1_name = "strelka", m2_name = "deepvariant",showDomainLabel = FALSE)
dev.off()

```

**Figure 3: Lollipop Plots of variants in BRCA1 and BRCA2 detected by Strelka and DeepVariant varcallers.** Graphical representation of the BRCA1 and BRCA2 genes, comparing the mutations obtained by Strelka and Deepvariant. The Y-axis shows the number of patients with missense mutations (green) and frameshift insertions (purple). The top section represents the results from Strelka, and the bottom section represents the results from DeepVariant of each gene. The domains are in the labels. The results correspond to the variants obtained by both varcallers in single mode.

## Figure 4 - deleted

```{r heatmap, echo=FALSE, warning=FALSE}

## Import rename file
rename<-read.csv("../../annovar/rename.csv", header = FALSE)

## Import Deepvariant pool
deepvariant.pool.filter<-deepvariant.pool[deepvariant.pool$Variant_Classification!="Silent",]
## Import Deepvariant single
deepvariant.single.filter<-deepvariant.single[deepvariant.single$Variant_Classification!="Silent",]
## Import Strelka pool
strelka.single.filter<-strelka.single[strelka.single$Variant_Classification!="Silent",]
## Import Strelka single
strelka.pool.filter<-strelka.pool[strelka.pool$Variant_Classification!="Silent",]

## Preprocesing DEEPVARIANT
deepvariant.pool$REVEL<-as.numeric(deepvariant.pool$REVEL)
deepvariant.single$REVEL<-as.numeric(deepvariant.single$REVEL)

deepvariant.pool$AAChange<-deepvariant.pool$aaChange
deepvariant.pool$aaChange<-"MS"
deepvariant.pool$mode<-"P"
deepvariant.single$AAChange<-deepvariant.single$aaChange
deepvariant.single$aaChange<-"S"
deepvariant.single$mode<-"S"

common_cols <- intersect(colnames(deepvariant.single), colnames(deepvariant.pool))
deepvariant <- rbind(
  subset(deepvariant.single, select = common_cols), 
  subset(deepvariant.pool, select = common_cols)
)
deepvariant$varcall<-"DeepVariant"

## Preprocesing STRELKA
strelka.pool$REVEL<-as.numeric(strelka.pool$REVEL)
strelka.single$REVEL<-as.numeric(strelka.single$REVEL)

strelka.pool$AAChange<-strelka.pool$aaChange
strelka.pool$aaChange<-"MS"
strelka.pool$mode<-"P"
strelka.single$AAChange<-strelka.single$aaChange
strelka.single$aaChange<-"S"
strelka.single$mode<-"S"

common_cols <- intersect(colnames(strelka.single), colnames(strelka.pool))
strelka <- rbind(
  subset(strelka.single, select = common_cols), 
  subset(strelka.pool, select = common_cols)
)
strelka$varcall<-"Strelka"

test<-strelka.pool[strelka.pool$AAChange=="p.T1301Afs*32",]

### MERGE DEEPVARIANT (single-pool) - STRELKA (single-pool)
strelka.select<-strelka[c("Tumor_Sample_Barcode","aaChange","mode","varcall", "REVEL","AAChange", "Hugo_Symbol")]
deepvariant.select<-deepvariant[c("Tumor_Sample_Barcode","aaChange","mode","varcall", "REVEL","AAChange", "Hugo_Symbol")]

All.info<-rbind(strelka.select,deepvariant.select)
##All.info$aaChange<-reorder(All.info$aaChange, All.info$Start_Position)
All.info$Hugo_Symbol <- factor(All.info$Hugo_Symbol, levels = c("BRCA1", "BRCA2"))

# Order by gen and position of AAChange
#order_AAChange <- c("p.S104G", "p.P824L", "p.D646N", "p.L655Ffs*10", "p.E991G", "p.S993N", "p.K1136R", "p.E1390G",
#                    "p.N289H", "p.N372H", "p.K797Q", "p.N991D", "p.T1915M", "p.R2108H", "p.V2466A", "p.A2951T","p.I2490T")

#All.info <- All.info %>%
#  mutate(Ordered_AAChange = factor(AAChange, levels = order_AAChange))
All.info$Tumor_Sample_Barcode<-rename$V2[match(All.info$Tumor_Sample_Barcode, rename$V1)]


heatmap <- ggplot(data = All.info, mapping = aes(x = varcall, y = aaChange, fill = REVEL)) +
  geom_tile(color = "white", size = 0.1) +  # Definir el color y el tamaño de los bordes de las celdas
  scale_fill_gradient(low = "lightyellow", high = "red", limits = c(0, 0.7), na.value = "grey") +
  xlab("Sample") +
  ylab("Amino Acid Change") +
  facet_grid(AAChange ~ Tumor_Sample_Barcode, scales = "free", space = "free") + ##AAChange
  theme_minimal(base_size = 10) +
  theme(
    panel.spacing = unit(0, "lines"),
    panel.border = element_rect(color = "black", fill = NA, size = 0.1),
    panel.grid.major = element_blank(),  # Eliminar líneas de la cuadrícula mayor
    axis.text.x = element_text(color = "black", size = 8, angle = 90, hjust = 1, vjust = 0.5),  # Ajustar tamaño y ángulo del texto del eje
    axis.text.y = element_text(color = "black", size = 8),  # Ajustar tamaño del texto del eje y
    strip.text.y = element_text(size = 8, angle = 0, colour = "black", margin = margin(r = 100)), 
    strip.text.x = element_text(size = 8, angle = 90, colour = "black"), 

    legend.position = "top",  # Mover la leyenda a la derecha
    legend.title = element_text(size = 8),
    legend.text = element_text(size = 8),
    plot.title = element_text(hjust = 0.5, size = 12, face = "bold")  #Ajustar tamaño y alineación del título
  ) +
  ggtitle("BRCA1/2 Mutations")

## save plot
png(filename = "Plots_BRCA_files/V2/fig4.png", height = 12, width = 20, res = 800, units = "in")
#pdf(file = "Plots_BRCA_files/V2/fig4.pdf",  height = 6, width = 8)  # Tamaño en pulgadas
heatmap
dev.off()
```

**Figure 4: Heatmap of BRCA1/2 mutations reported in 16 breast cancer patients using DeepVariant and Strelka, both in single and multisample mode.** The Y-axis shows the amino acid changes (right) in single and pooled modes (left), and the X-axis shows the 16 patients (top) across the two varcallers used (bottom). The color indicates the potential pathogenicity as given by the REVEL score. Variants without REVEL score are in gray.  Single-sample mode (S), Multisample mode (MS). 

## Figure 5

```{r samplesCHI, echo=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
varCHI<-read.csv("input/all.strelka.BRCA.annovar.hg38_multianno.txt",sep = "\t")
varCHI<-varCHI[varCHI$Func.refGene=="exonic",] 
varCHI$genomic_vcf38<-paste0(varCHI$Chr,":g.",varCHI$Otherinfo5,":",varCHI$Otherinfo7,">",varCHI$Otherinfo8)
### Calcular la frecuencia
varCHI$Tumor_Sample_Barcode<-NULL
varCHI_transposed <- varCHI %>%
  pivot_longer(cols = 58:120, names_to = "Tumor_Sample_Barcode", values_to = "Column_Info")

varCHI_transposed_filtered <- varCHI_transposed %>%
  filter(!(startsWith(Column_Info, "0/0") | startsWith(Column_Info, ".:."))) %>%
  separate(Column_Info, into = c("GT", "GQ", "GQX", "DP", "DPF", "AD", "ADF", "ADR", "SB", "FT", "PL"), sep = ":")

varCHI.pass<-varCHI_transposed_filtered[varCHI_transposed_filtered$Otherinfo10=="PASS",]

varCHI.pass <- varCHI.pass %>% separate(GT, c("A1","A2"), "[/||]")
varCHI.pass$sum<- as.numeric(varCHI.pass$A1) + as.numeric(varCHI.pass$A2)

varCHI.pass_suma <- varCHI.pass %>%
  group_by(genomic_vcf38) %>%
  summarise(total_sum = sum(sum, na.rm = TRUE), .groups = 'drop')

varCHI.pass_suma$sum<-NULL
varCHI.pass_suma$chi_cnt<-varCHI.pass_suma$total_sum/126
varCHI.pass_suma$total_sum<-NULL
varCHI.pass_suma$chi_cnt<-as.numeric(varCHI.pass_suma$chi_cnt)
```


```{r PCA1, echo=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(factoextra)
library(FactoMineR)

variantes<-read.csv("input/Strelka_single_variants_PASS_exonic.tsv",sep = "\t")
variantes<-variantes[variantes$Variant_Classification!="Silent",]

variantes$genomic_vcf38<-paste0(variantes$Chr,":g.",variantes$Otherinfo5,":",variantes$Otherinfo7,">",variantes$Otherinfo8)

variantes <- variantes %>% separate(Otherinfo13, c("GT"), "[:]")
variantes <- variantes %>% separate(GT, c("A1","A2"), "[/||]")
variantes$sum<- as.numeric(variantes$A1) + as.numeric(variantes$A2)
variantes2<-variantes

variantes.aux<-variantes[variantes$gnomad40_genome_AF!=".",]

variantes_suma <- variantes.aux %>%
  group_by(genomic_vcf38,
           gnomad40_genome_AF_afr,
           gnomad40_genome_AF_amr,
           gnomad40_genome_AF_ami,
           gnomad40_genome_AF_asj,
           gnomad40_genome_AF_eas,
           gnomad40_genome_AF_sas,
           gnomad40_genome_AF_fin,
           gnomad40_genome_AF_mid,
           gnomad40_genome_AF_nfe) %>%
  summarise(total_sum = sum(sum, na.rm = TRUE), .groups = 'drop')

variantes_suma.aux<-variantes_suma

variantes_suma$sum<-NULL
variantes_suma$AF<-variantes_suma$total_sum/166
variantes_suma$total_sum<-NULL
variantes_suma$AF<-as.numeric(variantes_suma$AF)
colnames(variantes_suma)<-c("genomic_vcf38","afr","amr","ami","asj","eas","sas","fin","mid","nfe","chi_brca")
variantes_suma <- variantes_suma %>% mutate_at(c("afr","amr","ami","asj","eas","sas","fin","mid","nfe","chi_brca"), as.numeric)

data.pca<-merge(variantes_suma,varCHI.pass_suma, all.x = T)

data.pca$chi_cnt<- ifelse(is.na(data.pca$chi_cnt), 0,data.pca$chi_cnt)

# Convert tibble to data frame
data_pca_df<- t(as.data.frame(data.pca[,-1]))
# Set the row names
colnames(data_pca_df) <- data.pca[[1]]

## PCA
res.pca<-prcomp(data_pca_df)
res2.pca <- PCA(data_pca_df, ncp = 2, graph = FALSE,scale.unit = TRUE)
var <- get_pca_var(res2.pca)

# Compute PCA with ncp = 2
res.hcpc <- HCPC(res2.pca, graph = FALSE)

dend<-fviz_dend(res.hcpc,
          cex = 0.7, 
          palette = "jco", 
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "jco", 
          labels_track_height = 0.8,
          main="")

contrib<-fviz_contrib(res2.pca, choice = "var", axes = 1,title="", top = 10)

### Plot 1 - PCA - biplot
pca_biplot<-fviz_pca_ind(res2.pca, repel = TRUE, col.ind = "#696969",title="")

### Plot 2 - Percentahe of explained variances
scree_plot<-fviz_eig(res2.pca,addlabels = TRUE,title="",hjust = -0.5)

cluster<-fviz_cluster(res.hcpc,
             repel = TRUE,
             show.clust.cent = TRUE, 
             palette = "jco",
             ggtheme = theme_minimal(),
             main = "")


combined_plot<-plot_grid(cluster, contrib, nrow = 1, ncol = 2)
# save plot
png(file="Plots_BRCA_files/V2/Fig5.png", width=8, height=3, res=400, unit = "in")
combined_plot
dev.off()

combined_plot2 <- plot_grid(pca_biplot, scree_plot, dend, rel_heights = 2, 
                           nrow = 1, ncol = 3)
# save plot
png(file="Plots_BRCA_files/V2/Sup_fig7.png", width=12, height=4, res=400, unit = "in")
combined_plot2
dev.off()
combined_plot2
```

**Figure 5: Hierarchical clustering and variable contribution on principal components.** A) Clustering of PCA dimension B) Contribution of mutations to dimensions 1 and 2. A dashed line shown corresponds to the expected value if the contribution were uniform. Subpopulations: Africans (AFR), Admixed Americans (AMR), Amish (AMI), Ashkenazi Jewish (ASJ), East Asians (EAS), South Asians (SAS), Finns (FIN), Middle Easterners (MID), Non-Finnish Europeans (NFE), and women (XX).

## Figure 6

```{r sophia, echo=FALSE, warning=FALSE, prompt=FALSE}
library(tidyr)
library(dplyr)
library(cowplot)
library(ggVennDiagram)
library(ggplot2)

# Cargar y procesar archivo SOPHiA
load_sophia <- function(path) {
  read.csv(path, header = TRUE) %>%
    mutate(ID = paste0(SampleId, "|", hg38.genome_position, "|", hg38.ref, "|", hg38.alt)) %>%
    filter(codingConsequence %in% c("missense", "frameshift", "nonsense")) %>%
    select(ID, CLNSIGLAB, c.DNA, protein)
}

# Cargar y procesar archivo Strelka o DeepVariant
load_variant_data <- function(path, caller, single) {
  df <- read.csv(path, header = TRUE, sep ="\t")

    if(single){
        if(caller=="strelka") {
            df <- df %>% separate(Otherinfo13, c("GT", "GQ", "GQX", "DP", "DPF", "AD"), "[:]", extra = "drop")
        }else {
      df <- df %>% separate(Otherinfo13, c("GT", "GQ", "DP", "AD", "VAF", "PL"), "[:]", extra = "drop")
      }
    }
  
  df <- df %>%
    separate(AD, c("DP_REF", "DP_ALT"), "[,]") %>%
    mutate(ID = paste0(Tumor_Sample_Barcode, "|", Otherinfo5, "|", Otherinfo7, "|", Otherinfo8),
           DP = as.numeric(DP)) %>%
    filter(DP > 30, Tumor_Sample_Barcode != "61") %>%
    filter(Variant_Classification %in% c("Missense_Mutation", "Frame_Shift_Ins", 
                                         "Frame_Shift_Del", "Nonsense_Mutation")) %>%
    select(ID, Hugo_Symbol, Variant_Type, aaChange, DP_ALT, DP_REF, CLNSIG, DP) %>%
    mutate(ID = gsub("48\\|43093043\\|TAA\\|T", "48|43093043|TAAA|TA", ID),
           ID = gsub("43\\|43093043\\|TAA\\|T", "43|43093043|TAAA|TA", ID),
           ID = gsub("39\\|43093425\\|T\\|TA", "39|43093425|TAA|TAAA", ID),
           ID = gsub("77\\|43093425\\|T\\|TA", "77|43093425|TAA|TAAA", ID),
           ID = gsub("LV\\|43093425\\|T\\|TA", "LV|43093425|TAA|TAAA", ID))
}

# Crear Venn plot
create_venn <- function(list_input, title, colors, names) {
  ggVennDiagram(
    list_input, set_size = 3.5,
    category.names = names) +
    scale_x_continuous(expand = expansion(mult = .2)) +
    ggtitle(title) +
    ggplot2::scale_fill_gradient(low = colors[1], high = colors[2]) +
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
}

### MAIN PIPELINE ###
sophia <- load_sophia("input/all_variants_sophia.csv")

# Single
strelka_single <- load_variant_data("input/Strelka_single_variants_PASS_exonic.tsv", "strelka",TRUE)
deep_single <- load_variant_data("input/DeepVariant_single_variants_PASS_exonic.tsv", "deep",TRUE)

venn_single_ns <- create_venn(
  list(SOPHiA = sophia$ID, Strelka_single = strelka_single$ID, deepvariant_single = deep_single$ID),
  "Non-synonymous", c("#F6AC50", "#B54923"),
  c("SOPHiA\nGENETICS", "STRELKA\nSINGLE", "\nDEEPVARIANT SINGLE"))

venn_single_pat <- create_venn(
  list(SOPHiA = filter(sophia, CLNSIGLAB == "Pathogenic")$ID,
       Strelka_single = filter(strelka_single, CLNSIG == "Pathogenic")$ID,
       Deepvariant_single = filter(deep_single, CLNSIG == "Pathogenic")$ID),
  "Pathogenic", c("#87B6D9", "#497AA7"),
  c("SOPHiA\nGENETICS", "STRELKA\nSINGLE", "\nDEEPVARIANT SINGLE")
)
#save_venn_plots(venn_single_ns, venn_single_pat,"Plots_BRCA_files/V2/fig6.png")
png(file = "Plots_BRCA_files/V2/fig6.png", width = 10, height = 5, res = 400, unit = "in")
plot_grid(venn_single_ns, venn_single_pat, labels = c("A", "B"), ncol = 2)
dev.off()

# Multisample
strelka_multi <- load_variant_data("input/Strelka_pool_variants_PASS_exonic.tsv", "strelka",FALSE)
deep_multi <- load_variant_data("input/DeepVariant_pool_variants_PASS_exonic.tsv", "deep",FALSE)

venn_multi_ns <- create_venn(
  list(SOPHiA = sophia$ID, Strelka_multisample = strelka_multi$ID, deepvariant_multisample = deep_multi$ID),
  "Non-synonymous", c("#F6AC50", "#B54923"),
  c("SOPHiA\nGENETICS", "STRELKA\nMULTISAMPLE", "\nDEEPVARIANT MULTISAMPLE"))

venn_multi_pat <- create_venn(
  list(SOPHiA = filter(sophia, CLNSIGLAB == "Pathogenic")$ID,
       Strelka_multisample = filter(strelka_multi, CLNSIG == "Pathogenic")$ID,
       Deepvariant_multisample = filter(deep_multi, CLNSIG == "Pathogenic")$ID),
  "Pathogenic", c("#87B6D9", "#497AA7"),
  c("SOPHiA\nGENETICS", "STRELKA\nMULTISAMPLE", "\nDEEPVARIANT MULTISAMPLE")
)

png(file ="Plots_BRCA_files/V2/Sup_fig6.png", width = 10, height = 5, res = 400, unit = "in")
plot_grid(venn_multi_ns, venn_multi_pat, labels = c("A", "B"), ncol = 2)
dev.off()
```
**Figure 6: Overlap of BRCA1/BRCA2 variants identified by different variant calling approaches.** Venn diagram comparing BRCA1 and BRCA2 variants reported by SOPHiA Genetics in breast cancer patients with those identified using a workflow based on the Strelka and DeepVariant variant callers in single mode. A) Non-synonymous variants detected across the three approaches. B) Subset of non-synonymous variants classified as pathogenic in ClinVar database
## Supplementary Figure 1

```{r timetask, echo=FALSE, warning=FALSE, prompt=FALSE}

trace_data <- read_tsv("input/combined_trace_10RUN.txt",show_col_types = FALSE)

# split 'name' -> 'name'  'extra'
trace_data <- trace_data %>%
separate(name, into = c("name", "extra"), sep = "\\(", remove = TRUE) %>%
mutate(extra = gsub("\\)", "", extra))  # Eliminar el paréntesis de cierre

trace_data$time_cp<-trace_data$realtime

trace_data <- trace_data %>%
separate(realtime, into = c("M", "S"), sep = " ", remove = TRUE, fill = "left")
trace_data$S<- as.numeric(gsub("s","",trace_data$S))
trace_data$M<- 60*as.numeric(gsub("m", "", trace_data$M))

trace_data$S<-ifelse(is.na(trace_data$S),0,trace_data$S)
trace_data$M<-ifelse(is.na(trace_data$M),0,trace_data$M)
trace_data$SUM<-(trace_data$M + trace_data$S)
trace_data$minutes<-trace_data$SUM/60

# box plot
boxplot<-ggplot(trace_data, aes(x = name, y = minutes)) +
  geom_boxplot(fill = "lightblue", color = "darkblue", outlier.color = "darkblue", outlier.shape = 16) +
  labs(title = "",
       x = "Process",
       y = "Execution time (minutes)") +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#png(file="boxplot_run_metrics.png", width=10, height=5, res=300, unit = "in")
#boxplot
#dev.off()
boxplot

```

**Figure 1. Execution times of processes in the BRCA Nextflow pipeline.**. This visualization shows the distribution of computation times for each process, based on 10 independent executions using 16 sequenced breast cancer samples. The specific processes are as follows.
-ANNOVAR_DP: Annovar annotation of DeepVariant output in multisample mode
-ANNOVAR_DS: Annovar annotation of DeepVariant output in single mode
- ANNOVAR_SP: Annovar annotation of Strelka output in multisample mode
- ANNOVAR_SS: Annovar annotation of Strelka output in single mode
- B2C: BAM to CRAM conversion
- BCFTOOLS_FILTER and BCF: VCF filtering
- STRELKA_POOL: Variant calling by Strelka in multisample mode
- STRELKA_ONESAMPLE: Variant calling by Strelka in single mode
- DEEPVARIANT_ONESAMPLE: Variant calling by DeepVariant in single mode
- GLNEXUS_DEEPVARIANT: Variant calling by DeepVariant in multisample mode.


## Supplementary Figure 2

```{r fun_PlotQC, echo=FALSE, warning=FALSE}

plotQC <- function(data,title,color) {
  
plot <- ggplot(data, aes(x=variable, y = value)) +
  geom_violin(trim = FALSE, fill =color, color =color, alpha = 0.1) + 
  geom_jitter(width = 0.2, size = 1.5, color =color) + # Puntos transparentes
  stat_summary(
    fun = median, 
    geom = "crossbar", 
    width = 2, 
    linetype = "dashed", 
    color = "black", 
    alpha = 0.9,
    size = 0.2 
  ) + 
  coord_flip() + 
  theme_minimal(base_size = 10) + 
  theme(
    panel.grid.major = element_line(color = "gray90"),
    plot.title = element_text(hjust = 0.5),
    axis.line.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank()
  ) +
  labs(x ="", y ="", title = title) +
  theme(
    plot.title = element_text(face = "bold")
  )
return(plot)
}
```

```{r data, echo=FALSE, warning=FALSE,message=FALSE}
library(reshape)
## Load outputs of multiQC obtained for short reads
general_qc<-read.csv("input/multiqc_general_stats.txt", sep = '\t')
melt_data<-melt(general_qc)
```

```{r MGIplots, echo=FALSE, warning=FALSE}
### Plots QC metrics short reads
## Plot N1
total_reads<-melt_data %>% filter(variable=="FastQC_mqc_generalstats_fastqc_total_sequences" & ! is.na(value))
#total_reads$value<-as.numeric(total_reads$value/1000000)
mean(pct_aligned$value)
summary_total_reads <- summary(total_reads)

plot.total_reads<-plotQC(total_reads,"Total Reads (M)","#1B9E77")
## Plot N2
pct_aligned<-melt_data %>% filter(variable=="QualiMap_mqc_generalstats_qualimap_percentage_aligned" & ! is.na(value))
mean(pct_aligned$value)
plot.pct_aligned<-plotQC(pct_aligned,"Aligned Reads (%)","#E7298A")
## Plot N3
percentage_50X<-melt_data %>% filter(variable=="QualiMap_mqc_generalstats_qualimap_mqc_50_x_pc"  & ! is.na(value))
mean(percentage_50X$value)
plot.percentage_50X<-plotQC(percentage_50X,"Reads with ≥50X Coverage (%)","#E6AB02")
## plot N4
mean_coverage<-melt_data %>% filter(variable=="QualiMap_mqc_generalstats_qualimap_mean_coverage"  & ! is.na(value))
plot.mean_coverage<-plotQC(mean_coverage,"Mean Coverage","#7570B3")
## plot N5
median_insert_size<-melt_data %>% filter(variable=="QualiMap_mqc_generalstats_qualimap_median_insert_size"  & ! is.na(value))
plot.median_insert_size<-plotQC(median_insert_size,"Median Insert Size (bp)","#66A61E")
## plot N6
gc<-melt_data %>% filter(variable=="QualiMap_mqc_generalstats_qualimap_avg_gc"  & ! is.na(value))
plot.gc<-plotQC(gc,"GC Content (%)","#D95F02")
```

```{r grid2, echo=FALSE, warning=FALSE}
combined_plot1 <- plot_grid(plot.total_reads, plot.pct_aligned,
                           plot.mean_coverage,plot.percentage_50X,
                           plot.median_insert_size,plot.gc, ncol = 2,nrow = 3,
                           labels = c("A", "B","C","D","E","F"))

combined_plot1
```


## Supplementary Figure 3

```{r coverage, echo=FALSE}

cov<-read.csv("input/coverage_samtools_n16.txt",sep = "\t", header = F)
colnames(cov)<-c("sample", "chr","coordinate","coverage")

cov$log_cov<-log(cov$coverage)

brca1<-cov[cov$chr=="13",]
brca1$chr<-"Chromosome 13"
brca1_plot <- ggplot(brca1, aes(x = coordinate, y = log(coverage), color = sample)) +
  geom_point() +
  facet_wrap(~chr) +
  theme_minimal() +
 ylim(0, 10) +
  labs(
    title = "",
    x = "",
    y = "Log(coverage)",
    color = "Sample"
  ) +
  theme(legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title.x = element_text(face = "bold", size = 12),
    axis.title.y = element_text(face = "bold", size = 12),
    legend.title = element_blank(),
    legend.text = element_text(size = 10)
  ) +
  viridis::scale_color_viridis(discrete = TRUE, begin = 0, end = 0.8)

brca2<-cov[cov$chr=="17",]
brca2$chr<-"Chromosome 17"
brca2_plot <- ggplot(brca2, aes(x = coordinate, y = log(coverage), color = sample)) +
  geom_point() +
  facet_wrap(~chr) +
  theme_minimal() +
  ylim(0, 10) +
  labs(
    title = "",
    x = "Genomic Coordinate",
    y = "Log(coverage)",
    color = "Sample"
  ) +
  theme(legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title.x = element_text(face = "bold", size = 12),
    axis.title.y = element_text(face = "bold", size = 12)
  ) +
  viridis::scale_color_viridis(discrete = TRUE, begin = 0, end = 0.8)


plot_coverage <- plot_grid(brca1_plot, brca2_plot,
                           nrow = 2, ncol = 1)

#png(file="SF3_coverage.png", width=15, height=8, res=500, unit = "in")
#plot_coverage
#dev.off()
plot_coverage
```

**Figure 3: Coverage by position on target regions of BRCA1/2 genes.** The x-axis represents the genomic coordinates, and the y-axis represents the coverage in logarithmic scale at each on-target position. Each line represents the coverage obtained for patients sequenced. (A) Coverage by position in the BRCA2 gene. (B) Coverage by position in the BRCA1 gene.


## Supplementary Figure 4

```{r lollipopPlot2_strelka, echo=FALSE, warning=FALSE}

### strelka single vs multisample - BRCA1
png(filename = "Plots_BRCA_files/V2/Sup_fig4A.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 = ln.strelka.single.pass, m2 = ln.strelka.pool.pass, gene = 'BRCA1', AACol1 = 'aaChange', AACol2 = 'aaChange', m1_name = "strelka-single", m2_name = "strelka-pool", refSeqID = 'NM_007294',showDomainLabel = FALSE)
dev.off()

### strelka single vs multisample - BRCA2
png(filename = "Plots_BRCA_files/V2/Sup_fig4B.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 = ln.strelka.single.pass, m2 = ln.strelka.pool.pass, gene = 'BRCA2', AACol1 = 'aaChange', AACol2 = 'aaChange',m1_name = "strelka-single", m2_name = "strelka-pool",showDomainLabel = FALSE)
dev.off()

```

**Figure 4: Comparison of the nonsynonymous variants reported by the Strelka varcaller in single and multisample modes.** SNVs and INDELs reported for the 16 patients in the BRCA1 and BRCA2 genes. The Y-axis shows the number of patients with the mutation, and the X-axis shows the amino acids and domains of each gene.


## Supplementary Figure 5

```{r lollipopPlot2_Deepvariant, echo=FALSE, warning=FALSE}

### DeepVariant single vs multisample - BRCA1
png(filename = "Plots_BRCA_files/V2/Sup_fig5A.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 = ln.deepvariant.single.pass, m2 = ln.deepvariant.pool.pass, gene = 'BRCA1', AACol1 = 'aaChange', AACol2 = 'aaChange', m1_name = "deepvariant-single", m2_name = "deepvariant-pool", refSeqID = 'NM_007294',showDomainLabel = FALSE)
dev.off()

### DeepVariant single vs multisample - BRCA2
png(filename = "Plots_BRCA_files/V2/Sup_fig5B.png", height = 3, width = 8, res = 1000, units = "in")
lollipopPlot2(m1 = ln.deepvariant.single.pass, m2 = ln.deepvariant.pool.pass, gene = 'BRCA2', AACol1 = 'aaChange', AACol2 = 'aaChange',m1_name = "deepvariant-single", m2_name = "deepvariant-pool",showDomainLabel = FALSE)
dev.off()

```

**Figure 5: Comparison of the nonsynonymous variants reported by the DeepVariant varcaller in single and multisample modes.** SNVs and INDELs reported for the 16 patients in the BRCA1 and BRCA2 genes. The Y-axis shows the number of patients with the mutation, and the X-axis shows the amino acids and domains of each gen.

## Supplementary Figure 7

```{r PCA2, echo=FALSE, warning=FALSE}
## Ejecute chunk PCA1 before to run current chunk

### Plot 1 - PCA - biplot
pca_biplot<-fviz_pca_ind(res2.pca, repel = TRUE, col.ind = "#696969",title="")

### Plot 2 - Percentahe of explained variances
scree_plot<-fviz_eig(res2.pca,addlabels = TRUE,title="",hjust = -0.5)


cluster<-fviz_cluster(res.hcpc,
             repel = TRUE, # Avoid label overlapping
             show.clust.cent = TRUE, # Show cluster centers
             palette = "jco", # Color palette see ?ggpubr::ggpar
             ggtheme = theme_minimal(),
             main = "")

combined_plot2 <- plot_grid(pca_biplot, scree_plot, cluster, rel_heights = 2,
                           nrow = 1, ncol = 3)

# save plot 
png(file="Plots_BRCA_files/V2/Sup_fig7.png", width=12, height=4, res=400, unit = "in")
combined_plot2
dev.off()
combined_plot2
```

**Figure 7: Principal Component Analysis (PCA) of allele frequency variants in breast cancer patients.** a) PCA plot showing dimensions 1 and 2. b) Percentage of variance explained by each dimension. c) Clustering of populations. 


## Table 1

```{r table1, echo=FALSE, warning=FALSE}
## run chunck PCA1 (Figure 5) before chunck table 1. 

variantes.aux <-variantes.aux %>% separate(AAChange.refGene, c("A","B"), sep="NM_007294|NM_000059")
variantes.aux <-variantes.aux %>% separate(B,c("B","exon","txChange","aaChange"), sep = ":")

variantes.aux$tx[variantes.aux$Hugo_Symbol=="BRCA1"]<-"NM_007294.4"
variantes.aux$tx[variantes.aux$Hugo_Symbol=="BRCA2"]<-"NM_000059.4"

variantes.aux$sum<-NULL
variantes.aux$AF<-variantes.aux$total_sum/166
variantes.aux$total_sum<-NULL
variantes.aux$AF<-as.numeric(variantes.aux$AF)

#uniq<-variantes.aux %>% distinct(txChange,Gene.refGene, gnomad40_genome_AF)

### Change tx for BRCA1 by NM_007294

### Agregar BRCA-Exchage annotation - ENIGMA


```