I have a large dataset (about 10000 rows) and I'm trying to run a classification random forest which I intend to use to make predictions. My data is every imbalanced. For the outcome variable I'm trying to predict about 89% of the rows is marked "1" and the remainder is "0". The code I am using is as follows:
RFTry <-randomForest(as.factor(OutcomeVariable)~., data=df, importance=TRUE,
ntree=200, samplesize=c(500,500))
I am unsure of what samplesize I should be using. Should I be sampling the same number of rows for each outcome variable or different? And how many samples should I be taking? Below shows a table of the number of variables in each.
> table(df$OutcomeVariable)
0 1
10228 1234
Thank you!