I have incoming data that I want to store on disk in a database or something. The data looks something like this
incoming_data <- function(ncol=5){
dat <- sample(1:10,100,replace = T) |> matrix(ncol = ncol) |> as.data.frame()
random_names <- sapply(1:ncol(dat),\(x) paste0(sample(letters,1), sample(1:100,1)))
colnames(dat) <- random_names
dat
}
incoming_data()
This incoming_data is just for example..
In reality, one incoming_data set will have several 5k rows and about 50k columns. And the entire final file will be about 200-400 gigabytes
My question is how to add new data as columns to the database without loading the file into RAM
# your way
path <- "D:\\R_scripts\\new\\duckdb\\data\\DB.duckdb"
library(duckdb)
con <- dbConnect(duckdb(), dbdir = path, read_only = FALSE)
# write one piece of data in DB
dbWriteTable(con, "my_dat", incoming_data())
#### how to make something like this ####
my_dat <- cbind("my_dat", incoming_data())
Assuming that the number of rows remains the same across incoming batches of data, you can use the
positonal join(here) to achieve what you want:For each new incoming batch of data you can run the
create or replace statementfrom above to bind the new columns to the existing data;you can also adapt it to update the table with r objects:
Regarding your question, on how to do this procedure without loading the file into memory: in my experience, loading directly the files into duckdb without loading them into R should be the best practice here, and will in principle avoid the problem.
You might need to open and shutdown a connection per loaded file, to avoid crashing the R session, but that might have been a weird issue I had locally and might not translate into a problem here.
I hope if finally helps :)