I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning
process of a particular column that indicates the date when the property was built. I can't just impute
the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models
in R
so I would not want models like XGBoost
to automatically take care of imputation.
How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?
118 Views Asked by Anurag Mandal At
1
There are 1 best solutions below
Related Questions in R
- Get the last data of my google analytics dataset
- Is there any form to write to BigQuery specifying the name of destination tables dynamically?
- How to obtain java repositories having maximum number of stars in GitHub-Archive
- Possible to create BigQuery Table/Schema without populating with Data?
- Google spreadsheet script authorisation to BigQuery
- Google BigQuery Optimization Strategies
- Error when I try to create different BigQuery tables at the same pipeline execution
- Run BigQuery without login authentication
- Is there a CityHash Python (2.7) Implementation for Google App Engine?
- pandas read_gbq returns httplib.ResponseNotReady
Related Questions in LINEAR-REGRESSION
- Get the last data of my google analytics dataset
- Is there any form to write to BigQuery specifying the name of destination tables dynamically?
- How to obtain java repositories having maximum number of stars in GitHub-Archive
- Possible to create BigQuery Table/Schema without populating with Data?
- Google spreadsheet script authorisation to BigQuery
- Google BigQuery Optimization Strategies
- Error when I try to create different BigQuery tables at the same pipeline execution
- Run BigQuery without login authentication
- Is there a CityHash Python (2.7) Implementation for Google App Engine?
- pandas read_gbq returns httplib.ResponseNotReady
Related Questions in MISSING-DATA
- Get the last data of my google analytics dataset
- Is there any form to write to BigQuery specifying the name of destination tables dynamically?
- How to obtain java repositories having maximum number of stars in GitHub-Archive
- Possible to create BigQuery Table/Schema without populating with Data?
- Google spreadsheet script authorisation to BigQuery
- Google BigQuery Optimization Strategies
- Error when I try to create different BigQuery tables at the same pipeline execution
- Run BigQuery without login authentication
- Is there a CityHash Python (2.7) Implementation for Google App Engine?
- pandas read_gbq returns httplib.ResponseNotReady
Related Questions in DATA-CLEANING
- Get the last data of my google analytics dataset
- Is there any form to write to BigQuery specifying the name of destination tables dynamically?
- How to obtain java repositories having maximum number of stars in GitHub-Archive
- Possible to create BigQuery Table/Schema without populating with Data?
- Google spreadsheet script authorisation to BigQuery
- Google BigQuery Optimization Strategies
- Error when I try to create different BigQuery tables at the same pipeline execution
- Run BigQuery without login authentication
- Is there a CityHash Python (2.7) Implementation for Google App Engine?
- pandas read_gbq returns httplib.ResponseNotReady
Related Questions in IMPUTATION
- Get the last data of my google analytics dataset
- Is there any form to write to BigQuery specifying the name of destination tables dynamically?
- How to obtain java repositories having maximum number of stars in GitHub-Archive
- Possible to create BigQuery Table/Schema without populating with Data?
- Google spreadsheet script authorisation to BigQuery
- Google BigQuery Optimization Strategies
- Error when I try to create different BigQuery tables at the same pipeline execution
- Run BigQuery without login authentication
- Is there a CityHash Python (2.7) Implementation for Google App Engine?
- pandas read_gbq returns httplib.ResponseNotReady
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Your question is very broad. There are actually multiple R packages that can help you here:
missForest
imputeR
mice
VIM
simputation
There are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.