I built an H2ORandomForestEstimator model using train() method on my spark dataframe with the target column containing values 0 or 1. I downloaded and printed its mojo files using model.download_mojo(MOJO_ZIP_PATH) and h2o.print_mojo(MOJO_ZIP_PATH, tree_index=tree_ind) functions respectively. A partial such output tree is shown below.
As can be seen, leaf nodes have a field named predValue containing a value between 0 and 1. What is the meaning of this predValue field? Does it mean that the target variable is likely to contain the value contained in predValue field if the input variables happen to meet this root to leaf path when predict() is called on them?
Moreover, I want to preprocess the output model of H2ORandomForestEstimator and filter only those rules (root to leaf paths) for which my model will predict 1. Is there a way to filter such rules by parsing the mojo files without actually running the predict() function on input variables? predValue field in the output mojo files looked promising to solve this problem but I could not figure out its co-relation with the output variable. Can it be used to figure out the top-N rules?
'trees': [{
'root': {
'nodeNumber': 0,
'weight': 18319.0,
'colId': 169,
'colName': 'pkg_items_gl_product_group_desc_1.gl_electronics',
'leftward': True,
'isCategorical': False,
'inclusiveNa': True,
'splitValue': 0.5,
'rightChild': {
'nodeNumber': 25,
'weight': 462.0,
'predValue': 0.9935065
},
'leftChild': {
'nodeNumber': 1,
'weight': 17857.0,
'colId': 0,
'colName': 'pkg_attr_total_pkg_price',
'leftward': True,
'isCategorical': False,
'inclusiveNa': True,
'splitValue': 186.52805,
'rightChild': {
'nodeNumber': 26,
'weight': 201.0,
'predValue': 0.9900498
},
'leftChild': {
'nodeNumber': 3,
'weight': 13184.0,
'colId': 149,
'colName': 'pkg_items_gl_product_group_desc_1.gl_automotive',
'leftward': True,
'isCategorical': False,
'inclusiveNa': True,
'splitValue': 0.5,
'rightChild': {
'nodeNumber': 27,
'weight': 312.0,
'predValue': 0.99038464
},
Thanks for the question.
Every leaf node contains
predValue= information regarding the final prediction made on that node.See tree structure info here:
To get information if the result prediction is 0 or 1, you must get the threshold (
default_threshold) used for these decisions. You can find thedefault_thresholdin the model info.You can get the decision path for the concrete node see
decision_pathsor decision paths for the whole tree, seetree_decision_path.If you are interested in leaf node assignment, see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html?#predicting-leaf-node-assignment.
Let me know if you have another question.