Can you extract scoring algorithm from Scikit-learn RandomForestClassifier and Load coefficients into Oracle?

2.4k Views Asked by At

I have run a RandomForestClassifier model in Python using the sklearn module. I saved the model in a pickle file. I then extract data from Oracle, save it as a .csv file, send this .csv file to a machine that can open the model's pickle file in Python, and score the data. Once the data is scored I send the results back to Oracle.

Is it possible to extract the scoring coefficients from the RandomForestClassifier(.predict_proba) function so I can load that data into Oracle and score the data solely inside of Oracle?

After reading the documentation, it appears the scoring algorithm is too complex to perform the above suggestion given that it has to push each new record through each tree before it can arrive at a final scored probability. Is this correct?

I appreciate your help in advance.

Matt

3

There are 3 best solutions below

0
On

AFAIK there is no ready-made tool to do so but you can read the Cython source code of the base decision tree class, in particular the predict method to understand how the prediction works from the fitted parameters of the Decision Tree model. The random forest prediction treats individual tree predictions as binary probabilities (0 or 1), average them and normalize them as written here.

Turning that into PL/SQL might not be trivial though. Apparently Oracle Data Mining has some support for PMML Import/Export of decision tree models among other models. Unfortunately I am not aware of any implementation of a PMML exporter for scikit-learn decision tree either (although it could be easier to write by taking source code of the graphviz tree exporter as an example for instance).

Also note that under PostgreSQL on the other hand you could directly use scikit-learn in a DB function written using PL/Python.

0
On

Here is how you would do it using the SKompiler library:

from skompiler import skompile
expr = skompile(gbr.predict)

skompile(rf.predict_proba).to('sqlalchemy/oracle')

It might not be the most efficient way for evaluating a RF classifier, of course - for large forests, the generated query may easily reach megabytes in size.

NB: If your forest has more than a hundred estimators, you may also need to increase the system recursion limit to compile it:

import sys
sys.setrecursionlimit(10000)
0
On

I was in a situation where I had to run a random forest model on an Oracle database. It's possible to generate a PL/SQL package that performs the same functions as the Python Sk-learn RF model.

This is pretty trivial to do once you have something like Daniele's answer from this SO

First you have this file: rforest_to_plsql.py

def t(n):
    return " " * 4 * n

def get_est_code(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value
    def recurse(left, right, threshold, features, node, depth, code):
        if (threshold[node] != -2):
            code += t(depth) + "if ( " + features[node] + " <= " + str(threshold[node]) + " ) then\n"
            depth += 1
            if left[node] != -1:
                code = recurse (left, right, threshold, features,left[node], depth, code)                 
            code += t(depth - 1) + "else\n"
            if right[node] != -1:
                code = recurse (left, right, threshold, features,right[node], depth, code)
            code += t(depth - 1) + "end if;\n"
            depth -= 1
        else:
            code +=  t(depth) + "return two_values(" + str(value[node][0][0]) + ", " + str(value[node][0][1]) + ");\n"
        return code
    return recurse(left, right, threshold, features, 0, 2, "")


def get_pkg_header_code(clf, feature_names):
    pkg_h_code = """create or replace package pkg_rforest_model as
    function predict_proba (\n"""
    for feat in feature_names:
        pkg_h_code += t(2) + feat + "   number,\n"
    pkg_h_code = pkg_h_code[:-2] + ")  return number;\n"
    pkg_h_code += "end pkg_rforest_model;"
    return pkg_h_code

def get_pkg_body_code(clf, feature_names):
    pkg_b_code = "create or replace package body pkg_rforest_model as\n"        
    #code for each estimator
    for index, estimator in enumerate(clf.estimators_):
        func_name = "f_est_" + str(index).zfill(3)
        pkg_b_code += t(1) + "function " + func_name + " (\n"
        for feat in feature_names:
            pkg_b_code += t(2) + feat + "   number,\n"
        pkg_b_code = pkg_b_code[:-2] + ") return two_values as\n    begin\n"
        pkg_b_code += get_est_code(clf.estimators_[index], ["f" + str(i) for i in range(7)])
        pkg_b_code += "    end " + func_name + ";\n"
    #this function calls all each estimator function and returns a weighted probability
    pkg_b_code += "    function predict_proba (\n"
    for feat in feature_names:
        pkg_b_code += t(2) + feat + "   number,\n"
    pkg_b_code = pkg_b_code[:-2] + ")  return number as\n    v_prob    number;\n"    
    for index, estimator in enumerate(clf.estimators_):
        func_name = "f_est_" + str(index).zfill(3)
        pkg_b_code += t(2) + "v_" + func_name + "_a number;\n"
        pkg_b_code += t(2) + "v_" + func_name + "_b number;\n"
        pkg_b_code += t(2) + "pr_est_" + str(index).zfill(3) + " number;\n"

    pkg_b_code += t(1) + "begin\n"    
    for index, estimator in enumerate(clf.estimators_):
        func_name = "f_est_" + str(index).zfill(3)
        pkg_b_code += t(2) + "v_" + func_name + "_a := " + func_name+ "(" + ", ".join(feature_names) + ").a;\n"
        pkg_b_code += t(2) + "v_" + func_name + "_b := " + func_name+ "(" + ", ".join(feature_names) + ").b;\n"
        pkg_b_code += t(2) + "pr_est_" + str(index).zfill(3) + " := v_" + func_name + "_a / ( v_" + \
                      func_name + "_a + v_" + func_name + "_b);\n"
    pkg_b_code += t(2) + "return  ("
    for index, estimator in enumerate(clf.estimators_):
        pkg_b_code += "pr_est_" + str(index).zfill(3) + " + "
    pkg_b_code = pkg_b_code[:-2] + ") / " + str(len(clf.estimators_)) + ";\n"
    pkg_b_code += t(1) + "end predict_proba;\n"   
    pkg_b_code += "end pkg_rforest_model;"
    return pkg_b_code

Then you train your model, and get the PL/SQL code back the file's functions:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import rforest_to_plsql
n_features = 4
X, y = make_classification(n_samples=1000, n_features=n_features,
                            n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
features = ["f" + str(i) for i in range(n_features)]
pkg_h_code = rforest_to_plsql.get_pkg_header_code(clf, features)
pkg_b_code = rforest_to_plsql.get_pkg_body_code(clf, features)
print pkg_h_code
print pkg_b_code

Once you've created that package on the database you can do something like:

select pkg_rforest_model.predict_proba(0.513889 , 0.511111 , 0.491667 ,  0)
from   dual;

This is pure PL/SQL and should run very quickly. If you have a very big RF, then you could compile the package natively for more performance. Be warned - the package could be 10s of 1000s of LOC.