I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.
How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?
164 Views Asked by Anurag Mandal At
1
There are 1 best solutions below
Related Questions in R
- in R, recovering strings that have been converted to factors with factor()
- How to reinstall pandoc after removing .cabal?
- How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals
- How to save t.test result in R to a txt file?
- how to call function from library in formula with R type provider
- geom_bar define border color with different fill colors
- Different outcome using model.matrix for a function in R
- Creating a combination data.table in R
- Force specific interactions in Package 'earth' in R
- Output from recursive function R
- Extract series of observations from dataframe for complete sets of data
- Retrieve path of supplementary data file of developed package
- r package development - own function not visible for opencpu
- Label a dataset according to bins of a histogram
- multiply each columns of a matrix by a vector
Related Questions in LINEAR-REGRESSION
- using apply with an anonymous function which uses specific locations in the row
- writing a wrapper for a linear modeling function [MASS::lm.gls()]
- Create a graph to display observed and fitted values
- How to weight station to Order Least Squares in python?
- How does "statsmodels.regression.linear_model. WLS" work?
- Calling R function within SQL - HANA database
- use common math library in java
- How to use multiple data to train a linear regression model in R
- Format of R's lm() Formula with a Transformation
- perl regression without intercept
- How to avoid float values in regression models
- Linear regression in R: invalid type (list) for variable?
- Regression column in pandas
- Spark's LinearRegressionWithSGD is very sensitive to feature scaling
- Python stats.linregress syntax error
Related Questions in MISSING-DATA
- Application is missing a default group leaderboard (ItunesConnect error)
- MongoDB - Loading data into sharded DB with balancer on
- Alternatives to count and know what columns have missing values in Pandas
- Obtain unstandardized factor scores from factor analysis in R
- R extracting non-missing data
- replacing a missing value in R with average value
- missing value error when using mirt itemfit function in R
- combining 2 dataframes, replacing values of one frame with other R
- averaging imputation of missing values
- Add missing lines in file with python
- How do I replace all NA with mean in R?
- Remove duplicates making sure of NA values R
- Fill NaN value to continuous time series data where some timeframe were missing
- Predict the values of the unknown numbers
- Replace missing values (given as strings) in pandas dataframe by np.NaN
Related Questions in DATA-CLEANING
- Munging text strings with okinas and other Hawaiian diacritical marks
- R Data Wrangling for Emails
- Replacing missing data with the mean of a subgroup in R
- How to clean columns & convert to datetime using python
- Index not showing in dataframe - need to display corresponding index then delete columns based on threshold using Pandas
- Data Cleaning for Survival Analysis Using a Participant's Own Data to Impute Values
- Unable to insert clean unicode text back into DataFrame in pandas
- What is the formal process of cleaning unstructured data
- Finding frequency of words after stemming in Python
- how to clean the obs values in a column in R
- Removing non-English words from text using Python
- Why do I get several lists when tokenizing in python?
- How to replace NA with latest value in unbalanced panel?
- applying a function with multiple arguments over multiple paired variables in R
- Cleaning inconsistent date formatting in pandas dataframe
Related Questions in IMPUTATION
- R Function for Rounding Imputed Binary Variables
- averaging imputation of missing values
- Data Cleaning for Survival Analysis Using a Participant's Own Data to Impute Values
- How will the Imputers work if all the values in a column is missing in input vector in sklearn
- Knn imputation using scikit-learn
- How to replace an NA with an equivalent value from elsewhere in a dataset?
- Replacing NA values with set criteria
- knnimpute package in python
- How to impute columns with categorial datatype in scikit-learn
- Error in "missforest" in R
- Use of statsmodels.imputation.mice
- TypeError when using MICE algorithm from Fancy Impute in Python
- Does R's Time-Series automatically generate missing data?
- Missing data in Dataframe using Python
- Understanding mi.anova output of mice function in R, miceadds package
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Your question is very broad. There are actually multiple R packages that can help you here:
missForestimputeRmiceVIMsimputationThere are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.