Constructing model.matrix in R cannot fit in memory (tried all memory-mapping packages)

832 Views Asked by At

I am trying to estimate an lm() fitment in R for a large sales dataset. The data itself is not so large that R cannot handle it; about 250MB in memory. The problem is when lm() is invoked to include all variables and cross-terms, the construction of the model.matrix() throws the error saying that the machine has run out of memory and cannot allocate the vector of size whatever (in this case, about 47GB). Understandable, I don't have that much RAM. The problem is, I have tried the ff, bigmemory, and filehash packages, all of which work fine for working outside of memory with the existing files (I particularly like the database functions of filehash). But I cannot, for the life of me, get the model.matrix to be created at all. I think the issue is that, despite mapping the output file to the database I created, R tries to set it up in RAM anyway, and can't. Is there a way to avoid this using these packages, or am I doing something wrong? [Also, using biglm and other functions to do things chunk-wise doesn't even allow me to chunk by one at a time. Again, it seems R is trying to make the WHOLE model.matrix first, before chunking it]

Any help would be greatly appreciated!

library(filehash)
library(ff)
library(ffbase)
library(bigmemory)
library(biganalytics)
library(dummies)
library(biglm)
library(dplyr)
library(lubridate)
library(data.table)



SID <- readRDS('C:\\JDA\\SID.rds')
SID <- as.data.frame(unclass(SID)) # to get characters as Factors

dbCreate('reg.db')
db <- dbInit('reg.db')
dbInsert(db, 'SID', SID)
rm(SID)
gc()

db$summary1 <-
  db$SID %>%
  group_by(District, Liable, TPN, mktYear, Month) %>%
  summarize(NV.sum = sum(NV))

start.time <- Sys.time()
# Here is where it throws the error:
db$fit <- lm(NV.sum ~ .^2, data = db$summary1)
Sys.time() - start.time
rm(start.time)
gc()

summary(fit)
anova(fit)
1

There are 1 best solutions below

0
On

This is based on the example from solve-methods in the Matrix package:

> ?`solve-methods`
> n1 <- 7; n2 <- 3
> dd <- data.frame(a = gl(n1,n2), b = gl(n2,1,n1*n2))# balanced 2-way
> X <- sparse.model.matrix(~ -1+ a + b, dd)# no intercept --> even sparser
> Y <- rnorm(nrow(X))
> # Forming normal equations manually and solving for beta-hat 
> solve(crossprod(X), crossprod(X, Y))
9 x 1 Matrix of class "dgeMatrix"
            [,1]
 [1,]  1.2384385
 [2,]  1.3313779
 [3,]  0.7497135
 [4,]  0.7840841
 [5,]  0.9586135
 [6,]  0.4667769
 [7,]  1.6648260
 [8,] -1.6669776
 [9,] -1.1142240