I am working on a large (but not enormous) data base of 1.1mln observations x 41 variables. Data are arranged as an unbalanced panel. Using these variables I specified three different models and I run each of them as a 1) fixed effects, 2) random effects and 3) pooled OLS regression.
The original .RData file containing only the data base is about 15Mb. The .RData containing the data base and the regression results (a total of 9 regressions) weights about 650Mb. I do realize that (from the base documentation)
An object of class c("plm","panelmodel").
A "plm" object has the following elements :
coefficients the vector of coefficients,
vcov the covariance matrix of the coefficients,
residuals the vector of residuals,
df.residual degrees of freedom of the residuals,
formula an object of class ’pFormula’ describing the model,
model a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp an object of class ’ercomp’ providing the estimation of the components of the
errors (for random effects models only),
call the call
even so, I am not able to understand why those files should be so massive.
To avoid overloading the memory while working with the plm
objects, I saved them in three different files (each of which weights now around 200Mb).
I called summary
one hour ago to see the fixed-effects model results but it hasn't showed me any results yet. My question now is pretty straightforward. Do you find this a normal behavior? Is there something I can do to reduce the plm
objects size and speed up the results retrieval?
Here are some things you might want to know:
- The data base I am using is in
data.table
format formula
s in the regressions are pre-assembled and are included in theplm
calls preceded byas.formula()
, as suggested here. Example:
form<-y~x1+x2+x3+...+xn
mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))
Please, let me know if there is any other info I can provide and that you might need to answer the question.
EDIT
I managed to make up a small scale data base with similar characteristics as the one I am working on. Here it is:
structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L,
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002",
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208,
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922,
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225,
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663,
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073,
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657,
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032,
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"),
sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none",
"drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929,
3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307,
5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077,
3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581,
6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232,
6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689,
1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348,
5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617,
3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1),
evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1,
0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0),
ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id",
"year", "study", "ethnic", "sport", "health", "gradec", "event",
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2",
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA,
30L))
The formula and the plm
call I used are:
form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3
plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))
Using object.size()
suggested by @BenBolker I found out that the call generated a plm
object weighting 64.5Kb, while the original data frame has size 6.9Kb, which means that the results are about 10 times larger than the input matrix. Here then I set the options suggested by @zx8754 below but unfortunately they had no effect.
When I finally called summary(plm.f)
I got the error message:
Error in crossprod(t(X), beta) : non-conformable arguments
which I eventually got also with my large data base, but only after hours of computing.
Here it is suggested that the problem might be due to the coefficient matrix being singular. However, testing for singularity with is.matrix.singular()
found in the matrixcalc
package it turns out that this is not the case.
Another couple of things you might want to know:
year
,ethnic
andhealth
are factors- Variables in the formula are more or less self-explanatory except for the last ones.
event
is a supposed traumatic event happened at a certain time. It is coded 1 in case of an event in a certain year and 0 otherwise. The variableevm1
is equal to 1 if one of these events happened in the year before (minus 1) and 0 otherwise. Similarly,evp1
is 1 if the event happens in the following year (plus 1) and 0 otherwise. Variablesndm.
andndp.
work in the same way but they are coded 1 when that distance is not observable (because the time period for a certain individual is too short) and 0 otherwise. The presence of so deeply connected variables raises the suspect of perfect collinearity. As told above however, a test revealed that the matrix in non-singular.
Let me tell once again that I would be very thankful if someone could answer the question.
About the error message
Error in crossprod(t(X), beta) : non-conformable arguments
:This is likely due to a singularity in the model matrix, just as suggested. Please keep in mind that a model matrix for fixed effects models is the transformed data (transformed data frame).
Thus, you will need to check for singularity of the transformed data. The fixed effects transformation can result in linear dependence (singularity) even if the original data are not linear dependent! The
plm
package has quite a good documentation about that issue in?detect.lindep
which I am going to repeat here partly (only one example):So you should run your function to detect linear dependence on the transformed data of the model which you get by
model.matrix(you_model)
. You can use the functions supplied by plm:detect.lindep
,alias
or any function that works on a matrix.You could also look at your plm model object:
your_model$aliased
to see if some variables have been dropped in the estimation.