How to fix "Cannot allocate vector of size..." when using filter-function?

98 Views Asked by Moritary At 17 December 2023 at 22:53

In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.

This is my code so far:

library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")

pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")

Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".

I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..

Original Q&A

There are 1 best solutions below

Tjark van de Merwe On 18 December 2023 at 16:27

For working with large datasets the arrow package might provide a solution. See the documentation for some examples.

But in the case of your code you could use:

library(dplyr)
library(arrow)

setwd("F:/data")
pl <- readRDS("pl.rds")

# define folder to store partitioned data file
dataset_path <- file.path(getwd(), "subset")
if(!dir.exists(dataset_path)) dir.create(dataset_path)

# break up file in smaller subsets
pl %>%
  group_by(syear) %>%
  write_dataset(dataset_path)

rm(pl)
gc()

# check
list.files(dataset_path, recursive = TRUE)

# make connection to data
dset <- open_dataset(dataset_path)

# do lazy loading and processing, example filtering
pl <- dset %>%
  filter(syear > 2012) %>%
  collect()

And you can use this not only to filter, but to do all kinds of operations without needing the full dataset in memory.

How to fix "Cannot allocate vector of size..." when using filter-function?

There are 1 best solutions below

Related Questions in R

Related Questions in MEMORY

Related Questions in DATASET

Related Questions in LARGE-FILES

Trending Questions

Popular # Hahtags

Popular Questions