I have 448 .tsv files that contain gene expression data (RNAseq) that were downloaded from The Cancer Genome Atlas Genomic Data Commons (GDC) Portal.
These files have 60666 rows and 9 columns. Then there is a metadata file that has case_id in the first column and file_name in the second column.
So what my merge_RNA function does is to take as input the metadata file and the directory where all the 448 .tsv files are. Then if the basename of the file matches the file_name, merge the files by the first three columns "gene_id", "gene_name", "gene_type" and the eighth column "fpkm_unstranded".
So the merged matrix should have 60661 rows and 451 columns whereby gene_id are the rows and the case_id are the columns (see attached images).
However, I have run into memory issues because the read.csv function reads the entire .tsv file into memory.
NOTE: I am using Posit Cloud (formerly RStudio) so I have limited computing power (only 16 GB RAM and 4 CPU).
# Define the function to merge all RNAseq quantification files into one dataframe
merge_rna <-function(metadata, fdir){
filelist <- list.files(fdir, pattern="*.tsv$",
recursive = TRUE, full.names=TRUE)
for (i in 1:length(filelist)){
iname <- basename(filelist[i])
isamplename <- metadata[metadata$file_name==iname, "sample"]
idf <- read.csv(filelist[i], sep="\t", skip=1, header=TRUE)
# remove first 4 rows
remove <- 1:4
idf_subset <- idf[-remove, c("gene_id", "gene_name", "gene_type", "fpkm_unstranded")]
rm(idf)
names(idf_subset)[4] <- isamplename
#print(dim(idf_subset))
if (i==1){
combined_df <- idf_subset
rm(idf_subset)
} else {
combined_df <- merge(combined_df, idf_subset, by.x='gene_id', by.y="gene_id", all=TRUE)
rm(idf_subset)
}
}
# use gene_id as row names and remove gene_id column
rownames(combined_df) <- combined_df$gene_id
combined_df <- combined_df[,-which(names(combined_df) %in% c("gene_id"))]
return(combined_df)
}
rnaCounts <- merge_rna(GDC_metadata_file, "/home/r1816512/TSV_subset")
The code I have provided seems to work but it uses a lot of memory. Please tell me what can be done to use less memory. Perhaps we don't have to read the entire file into memory?
I tried ChatGPT but despite all my best efforts at fine-tuning it to re-write my code and give me a version that uses less memory, all the code it gave me produced errors. That meme about ChatGPT producing code in less than 5 minutes but debugging that code takes 24 hours is indeed true.
ChatGPT recommended that one way to reduce memory usage is to read in the TSV files in chunks rather than all at once. This can be done using the readr package, specifically the read_tsv_chunked() function.
ChatGPT also recommended that an alternative way to use less memory is to process the data in chunks instead of reading the entire file into memory using the data.table package and processes the data in chunks.
However, none of the code it gave me that use both the readr and the data.table functions worked.
Here are the images of the metadata file and how the merged expression matrix should look like to help with answering the question.
NOTE: There was a tool called The GDC RNASeq Tool which was used to perform this exact task, i.e., download individual RNASeq files from the GDC Data Portal using a manifest file, merge into a single matrix identified by TCGA barcode/Case ID but it was DEPRECATED when the transcriptome profiling/gene expression quantification bioinformatic workflow changed from using HTSeq to STAR. https://github.com/cpreid2/gdc-rnaseq-tool https://gdc.cancer.gov/content/gdc-rnaseq-tool
It would be nice if someone knowledgeable enough could develop a new tool for this task to replace the gdc-rnaseq-tool.
Use
data.table::fread()to pull in each tsv, selecting only the columns you need, userbindlist(), merge once back onto themetaMatrixto get the sample id, and then (optionally) dcastOutput:
Input (metaMatrix)