I would like a bit of help with my code. This is my first time posting, so please excuse the length.
Overview: I have conducted a CTREE analysis to identify certain intersections associated with a particular outcome. Within my CTREE, I only used a handful of predictor variables. This went well. Now, I would like to compare the distribution of other variables (not included in the CTREE, but included in my larger dataset) within each intersection.
- For example, let's say my predictor variables are age, gender and race. I would like to figure out the frequency of education levels within each terminal node/intersection, and eventually go on to compare them across groups.
Here is some of the code I've tried so far, and the closest I've gotten to subsetting the observations in each terminal node:
set.seed(418)
eddata_ctree2 <- ctree(eddata2$edavoidever ~ gender + age + rural + immigration_3cat + race + sexwork + transid + disid,
data = eddata2, control = ctree_control(minsplit = 30))
plot(eddata_ctree2)
terminal_nodes <- unique(predict(eddata_ctree2, type = "node"))
samples_by_node <- lapply(terminal_nodes, function(node_id) {
df_node <- eddata2[predict(eddata_ctree2, newdata = eddata2, type = "node") == node_id, ]
return(df_node)
})
names(samples_by_node) <- as.character(terminal_nodes)
node5 <- samples_by_node[["5"]]
node6 <- samples_by_node[["6"]]
node8 <- samples_by_node[["8"]]
node9 <- samples_by_node[["9"]]
node10 <- samples_by_node[["10"]]
node12 <- samples_by_node[["12"]]
node13 <- samples_by_node[["13"]]
However, the issue I now run into is that the number of observations in the subsetted datasets are not equal to the number of observations within the CTREE. All of the datasets have a few more or less observations than the associated node, I'm not sure where these extra or missing observations are coming from. It's important to note that some of the observations have missing values for the predictor variables (so maybe that's the issue?)
Note: When I've used data_party (as follows), it gives me the correct number of observations, but only included the variables within the CTREE, and not the other variables (in the larger dataset- eddata2)
ever5 <- data_party (eddata_ctree2, id = 5)
Please let me know if you have any insights or know a better way to accomplish/fix this.
Thanks so much!
I think you want to get the predicted "node" and turn that into a factor. This can then be used for subsequent investigations. For a reproducible illustration let's predict iris species by sepal length only:
Then we can add the fitted node/group as a categorical factor variable into the data set:
And this can be used like any other factor variable, e.g., for creating exploratory displays: