How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?

118 Views Asked by At

I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.

1

There are 1 best solutions below

0
On

Your question is very broad. There are actually multiple R packages that can help you here:

  • missForest
  • imputeR
  • mice
  • VIM
  • simputation

There are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.

Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.

So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.

So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.

Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)

Most of the mentioned packages are quite easy to use, here an example for missForest:

library("missForest")

# create example dataset with missing values
missing_data_iris <- prodNA(iris, noNA = 0.1)  

# Impute the dataset
missForest(missing_data_iris)  

The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.