Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug found in merge.R #3

Open
bentjordan opened this issue Jun 8, 2023 · 1 comment
Open

Potential bug found in merge.R #3

bentjordan opened this issue Jun 8, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@bentjordan
Copy link
Collaborator

bentjordan commented Jun 8, 2023

Original merge.R had potential bug where sample names of merged matrix were determined by the file sample_names.txt. This file is provided by the user and is directly used to name the samples. If the order of samples in sample_names.txt does not match the order files are read into the script the read counts will not match up with the actual samples.

Original merge.py script

files =list.files(path="star_align",recursive=T,pattern="ReadsPerGene.out.tab",full.names=T)

for (file in files) {
    if (!exists("rc")) {
        rc=read.table(file,header=F,row.names=1,colClasses=c(NA,"NULL",NA,"NULL"),col.names=c("gene_id","NULL",file,"NULL"))
    } else if (exists("rc")) {
        temp_rc=read.table(file,header=F,row.names=1,colClasses=c(NA,"NULL",NA,"NULL"),col.names=c("gene_id","NULL",file,"NULL"))
        rc=cbind(rc,temp_rc)
        rm(temp_rc)
    }
}
head(rc)
ncol(rc)
sample_names=read.table("sample_names.txt")
sample_names
colnames(rc)=sample_names[, 1]
head(rc)
write.csv(rc,"reads_count/reads_count.csv")
@bentjordan
Copy link
Collaborator Author

Updated merge.py script

The updated merge.py script extracts the sample name from the input file. This ensures each sample is properly represented in the read counts matrix

## library is sense stranded
library(stringr)
files <- list.files(
        path = "star_align",
        recursive = TRUE,
        pattern = "ReadsPerGene.out.tab",
        full.names = TRUE)
sample_names <- list()
for (file in files) {
        basename <- basename(file)
        sample <- str_replace(basename, "ReadsPerGene.out.tab", "")
        sample_names <- append(sample_names, sample)
        if (!exists("rc")) {
                rc <- read.table(
                        file, header = FALSE, row.names = 1,
                        colClasses = c(NA,"NULL",NA,"NULL"),
                        col.names=c("gene_id", "NULL", file, "NULL"))
        } else if (exists("rc")) {
                temp_rc=read.table(
                        file,header = FALSE, row.names = 1,
                        colClasses = c(NA, "NULL", NA, "NULL"),
                        col.names=c("gene_id", "NULL", file, "NULL"))
                rc <- cbind(rc, temp_rc)
                rm(temp_rc)
        }
}

colnames(rc) <- sample_names
dir.create(file.path("reads_count"), showWarnings = FALSE)
write.csv(rc, "reads_count/reads_count.csv")

@bentjordan bentjordan added the bug Something isn't working label Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant