In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.
This is my code so far:
library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")
pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")
Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".
I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..
For working with large datasets the
arrowpackage might provide a solution. See the documentation for some examples.But in the case of your code you could use:
And you can use this not only to filter, but to do all kinds of operations without needing the full dataset in memory.