I have some missing data in one column (y) of a dataframe I am working with. I now want to impute this missing data using the information from all the available information of that dataframe (i.e., column y with the missing data and columns x1-x9). I want to use a Random Forest Model to impute the missing data in R.

First I used the basic randomForest package and trained a Random Forest (regression) model on 80% of the complete data (i.e., where I have all the variables given in all columns) in order to test the model with the remaining the 20% of the data. I then use "predict" to apply the model to my data (for testing purposes on the test data) to generate data imputations for my variable y:

rf = randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = train)

test$imputation <- predict(rf, newdata = test)

I like this approach, as it allows me to access all the trees that were formed via "predict.all = TRUE" and get an understanding of the variance in my results.

Now I just recently came across the "missForest" method/package, which seems to be (one of) the go-to method(s) for imputations using Random Forests.

When I test both models on the same data and check their accuracy on my test data using RSMEs the are pretty much the same. I am however confused and can't seem to figure out in what way exactly the two methods differ in terms of how the packages/functions are configured.

Can someone help me here?

0

There are 0 best solutions below