I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.
How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?
164 Views Asked by Anurag Mandal At
1
There are 1 best solutions below
Related Questions in R
- How to make an R Shiny app with big data?
- How do I keep only specific rows based on whether a column has a specific value?
- Likert scale study - ordinal regression model
- Extract a table/matrix from R into Excel with same colors and stle
- How can I solve non-conformable arguments in R netmeta::discomb (Error in B.matrix %*% C.matrix)?
- Can raw means and estimated marginal means be the same ? And when?
- Understanding accumulate function when .dir is set to "backwards"
- Error in if (nrow(peaks) > 0) { : argument is of length zero Calls: CopywriteR ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Execution ha
- How to increase quality of mathjax output?
- Convert the time intervals to equal hours and fill in the value column
- How to run an R function getpoints() from IPDfromKM package in an R shiny app which in R pops up a plot that utilizes clicks to capture coordinates?
- Replace NA in list of dfs in certain columns and under certain conditions
- R and text on Cyrillic
- The ts() function in R is returning the correct start and frequency but not end value which is 1 and not 179
- TROUBLING with the "DROP_NA" Function
Related Questions in LINEAR-REGRESSION
- Batch Gradient Descent algorithm in python is returning huge values
- Error in running a multi-level mixed effects model on microbiome data
- How can I improve R2 score in my regression model? Predicting House Prices
- I have two dataframes representing two different time points. I want to run a linear regression model with data from both time points
- GMMAT model fit and AIC
- Fitting a curve using Linear regression - CLS and NMF
- Error with WLS estimation in R: missing or negative weights not allowed
- Fitted surface does not resemble the heatmap produced from the same data
- Beta coefficient of direct effect increases after controlling for mediator
- How to exclude abnormal data points and smooth the data before linear fitting
- Performing a simple ridge regression
- Why TukeyHSD test keeps returning NA for a linear model in R?
- Inquiry regarding a linear regression model using Python and pandas
- How to find the x-intercept of Weibull distribution
- PyTorch matrix multiplication shape error: "RuntimeError: mat1 and mat2 shapes cannot be multiplied"
Related Questions in MISSING-DATA
- How can I collapse repeated missing observations into a single nonmissing observation for the same ID in SAS?
- "dosresmeta" package: Dealing with missing values in n and cases
- How can passive terms be rendered in the calculation of an MFA in R?
- Error in ArcPro's Fill Missing Values Tool using arcpy in Python
- gap fill for raster stack in R
- SKLearn algorithms than handle native NaN values
- How to add Zeros where observations are missing
- Django, Settings module not found on YouStable hostings
- How to feed or mask missing data to RNN, LSTM, and GRU with pytorch?
- slicing pandas columns individually between first and last valid index
- Identifying and consolidating duplicated observations
- Count NA Values by Group (Year and month) in several columns in R
- Fixed-effects regressions with Amelia object
- Removing rows that contain NA values also removes all rows that contain values
- Pycaret : Got Missing Value error in target col
Related Questions in DATA-CLEANING
- Approach for Data Cleaning a complex multi-table File
- Unable to filter in power bi dax query
- Removing duplicate data conditionally in Excel
- I need help using pandas to group data from multiple columns into labeled categories
- CSV file data manipulation in R
- Massive dataset - average values by month and location
- How can i find every instance of a repeating string in a list, and then concatenate it to the list element that precedes it in every instance?
- INTERNAL_ERROR Input row doesn't have expected number of values required by the schema
- Powerbi: remove part of the string value in column and put it to another table
- How to restart automatically the application after clearing its storage?
- Is it possible to read table from pdf below a specific text
- Survey treatment with R language (NA values)
- Convert numeric column to integer if possible, otherwise keep as numeric
- How do I transpose every line in a row to multiple columns?
- Is there a way to create even single year age from the groups based on a weight?
Related Questions in IMPUTATION
- Imputation by Class Average
- How do I return the actual statistical value computed by from pyspark's ml Imputer class?
- How to calculate pooled Cronbach's Alpha after multiple imputation
- Use of robust std errors in pooled regression & obtaining R^2 in R
- Mice() imputation using 2l.norm method gives Error in chol.default() leading minor of order 1 is not positive
- How to solve an error in MICE imputation in R - system is computationally singular?
- How to use xgBoost for imputation?
- Is there a proper way to apply median imputation by groups in caret?
- Difference between mice::pool and mitools::MIcombine
- Trouble exporting and saving multiply imputed data in R for future use
- How can I impute observations on one variable in a list of dataframes? (dyadic time series)
- How to access p-values and fit estimates after pooling results from large lavaan.mi object?
- Imputing with NN left NaN in the data
- Read already multiple imputed DataSet with mice (in R)
- What is the difference in models used between a data imputation with missForest and predict.randomForest from the randomForest package in R
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Your question is very broad. There are actually multiple R packages that can help you here:
missForestimputeRmiceVIMsimputationThere are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.