Partykit CTREE question: How to subset observations within each terminal node (including variables not part of CTREE)

46 Views Asked by At

I would like a bit of help with my code. This is my first time posting, so please excuse the length.

Overview: I have conducted a CTREE analysis to identify certain intersections associated with a particular outcome. Within my CTREE, I only used a handful of predictor variables. This went well. Now, I would like to compare the distribution of other variables (not included in the CTREE, but included in my larger dataset) within each intersection.

  • For example, let's say my predictor variables are age, gender and race. I would like to figure out the frequency of education levels within each terminal node/intersection, and eventually go on to compare them across groups.

Here is some of the code I've tried so far, and the closest I've gotten to subsetting the observations in each terminal node:

set.seed(418)
 
eddata_ctree2 <- ctree(eddata2$edavoidever ~ gender + age + rural + immigration_3cat + race + sexwork + transid + disid,
data = eddata2, control = ctree_control(minsplit = 30))
 
plot(eddata_ctree2)
 
terminal_nodes <- unique(predict(eddata_ctree2, type = "node"))
 
samples_by_node <- lapply(terminal_nodes, function(node_id) {
df_node <- eddata2[predict(eddata_ctree2, newdata = eddata2, type = "node") == node_id, ]
return(df_node)  
})
 
names(samples_by_node) <- as.character(terminal_nodes)
 
node5 <- samples_by_node[["5"]]
node6 <- samples_by_node[["6"]]
node8 <- samples_by_node[["8"]]
node9 <- samples_by_node[["9"]]
node10 <- samples_by_node[["10"]]
node12 <- samples_by_node[["12"]]
node13 <- samples_by_node[["13"]]

However, the issue I now run into is that the number of observations in the subsetted datasets are not equal to the number of observations within the CTREE. All of the datasets have a few more or less observations than the associated node, I'm not sure where these extra or missing observations are coming from. It's important to note that some of the observations have missing values for the predictor variables (so maybe that's the issue?)

Note: When I've used data_party (as follows), it gives me the correct number of observations, but only included the variables within the CTREE, and not the other variables (in the larger dataset- eddata2)

ever5 <- data_party (eddata_ctree2, id = 5)

Please let me know if you have any insights or know a better way to accomplish/fix this.

Thanks so much!

1

There are 1 best solutions below

2
Achim Zeileis On

I think you want to get the predicted "node" and turn that into a factor. This can then be used for subsequent investigations. For a reproducible illustration let's predict iris species by sepal length only:

library("partykit")
ct <- ctree(Species ~ Sepal.Length, data = iris)
plot(ct)

CTree for the prediction of species by sepal length

Then we can add the fitted node/group as a categorical factor variable into the data set:

iris$node <- factor(predict(ct, newdata = iris, type = "node"))
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species node
## 1          5.1         3.5          1.4         0.2  setosa    2
## 2          4.9         3.0          1.4         0.2  setosa    2
## 3          4.7         3.2          1.3         0.2  setosa    2

And this can be used like any other factor variable, e.g., for creating exploratory displays:

plot(Petal.Length ~ node, data = iris)

Boxplot of petal length by predicted node