I am working with Zillow ZTRAX data and am currently trying to use the MICE package for imputation purposes. Unfortunately, I am running into issues and since this is my first attempt at using MICE and doing imputation on ZTRAX data, I am having a difficult time with troubleshooting.
First, here is a look at the structure of the data, which contains 4,392,023 observations with 13 variables:
head(imputation_data)
# A tibble: 6 x 13
sale_date sale_price prop_latitude prop_longitude lot_sqft property_land_use year_built total_bedrooms total_baths airconditioning_type prop_fireplace prop_sqft building_age
<date> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 2015-01-01 798500 NA NA NA NA NA NA NA n n NA NA
2 2015-01-02 NA 42.7 -73.8 62726. GV107 1967 NA NA n n 712 48
3 2015-01-02 NA NA NA NA NA NA NA NA n n NA NA
4 2015-01-01 NA 42.8 -73.9 14810. RR101 1950 3 1 n Y 1370 65
5 2015-01-05 NA 42.7 -73.8 5227. RR101 1926 4 1.5 n Y 1770 89
6 2015-01-01 NA NA NA NA NA NA NA NA n n NA NA
As well, here is quick look at the number of NA values for each variable:
A tibble: 1 x 12
sale_date.na sale_price.na prop_lat.na prop_log.na prop_sqft.na land_use.na year_built.na bedrooms.na baths.na air.na fire.na age.na
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 2836767 730297 730297 1046670 440787 1038065 1576547 1195667 2471 0 1038065
To start, I was attempting to use MICE to impute missing property_land_use values, wherein - for example - RR101 indicates a single-family residence and GV107, in the above example, refers to a "governmental emergency building" (which, according to the ZTRAX data dictionary, is likely a police station/fire house). Side note: I just realized, given the scope of this research, that I can likely filter down to just RR type land uses. Anyways ...
I created my own formula in MICE:
### creating LM for property land use
form1 <- list(property_land_use ~ sale_date + sale_price + prop_latitude + prop_longitude + lot_sqft)
form1 <- name.formulas(form1)
### running the model
imp1 <- mice(imputation_data, formulas = form1, print = TRUE, m = 1, seed = 12199)
Given that similar properties are likely grouped together, I believe latitude and longitude are good variables, as well as lot_sqft.
The imputation runs just fine with MICE as indicated by the output of print = TRUE:
iter imp variable
1 1 property_land_use
1 2 property_land_use
1 3 property_land_use
1 4 property_land_use
1 5 property_land_use
2 1 property_land_use
2 2 property_land_use
2 3 property_land_use
2 4 property_land_use
2 5 property_land_use
Unfortunately, it does not seem that any imputation took place:
imp1 <- complete(imp1)
imp1 %>%
summarize(land_use.na = sum(is.na(property_land_use)))
land_use.na
1 440787
As you can see, the amount of NA values for property_land_use remained the same from pre-imputation data.
Any help/advice/guidance would be greatly appreciated.
I assume I am missing some small within the MICE workflow that is causing this, but I am not familiar enough with the package to know exactly what it is.
I think the problem is that since you're not imputing the variables that predict
property_land_use, when those other variables are missing, the imputed values will also be missing. Here's a small example:Note, in the example above, that
yhas two missing values - the first and fourth observations.xandzare fully observed for the first, but not the fourth observation. When I impute using the formula and look at the completed dataset, I see that the first observation has an imputed value but the fourth does not. If I use all the information to impute all the variables, you can see that we get a full complete dataset at the end:Created on 2022-11-20 by the reprex package (v2.0.1)