I am trying to perform an Ordinal Logistic Regression in Python calling R's mass.polr function with rpy2 (Python interface for the R language). However, I run into trouble when there are some collinear or almost collinear columns in my predictors: mass.polr automatically discards some of those columns during fitting, which causes an error when I try to get predictions on the training data.
Here is a minimum example:
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
mass = importr("MASS")
# dataframe with two collinear predictors (x1 and x2)
df = pd.DataFrame(columns = ['target', 'x1', 'x2', 'x3'],
data = [[ 0 , 0 , 0 , 1 ],
[ 1 , 1 , 1 , 0 ],
[ 2 , 1 , 1 , 1 ]])
model = mass.polr('as.factor(target) ~ .', df, Hess = True) # gives warning below
'''
Warning message:
In polr(as.factor(target) ~ ., data = df, Hess = TRUE) :
design appears to be rank-deficient, so dropping some coefs
'''
r.predict(model, df, type = "class").__array__() # gives error below
'''
Error in X %*% object$coefficients : non-conformable arguments
'''
The same error actually happens in R as well, but there I can at least see which columns have been discarded by looking at summary(model)
.
Instead, in Python, r.summary(model).rx2('coefficients')
(which should display the same output as summary(model)
in R) does not display coefficient names, but just bare values:
array([[4.57292582e+01, 8.25605929e+02, 5.53887231e-02],
[2.11604944e+01, 2.85721885e+02, 7.40597606e-02],
[3.19476895e+01, 3.60605165e+02, 8.85946531e-02],
[5.66312792e+01, 8.93862000e+02, 6.33557296e-02]])
Does anyone know a way to retrieve the coefficient names in Python? Or is there any other workaround?
Even without
pandas2ri.activate()
, the FloatMatrix that gets returned fromr.summary(model).rx2('coefficients')
does not include the variable names. However, we are able to extract those names with R'sdimnames
function. Full example below:returns
['x1', 'x3', '0|1', '1|2']
, showing x2 has been dropped.Alternatively, you can print the full model output with
r.print(r.summary(model))