Ordinal Logistic Regression in Python with rpy2 (Python interface for R): issue with collinear predictors

541 Views Asked by At

I am trying to perform an Ordinal Logistic Regression in Python calling R's mass.polr function with rpy2 (Python interface for the R language). However, I run into trouble when there are some collinear or almost collinear columns in my predictors: mass.polr automatically discards some of those columns during fitting, which causes an error when I try to get predictions on the training data.

Here is a minimum example:

from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr

pandas2ri.activate()

mass = importr("MASS")

# dataframe with two collinear predictors (x1 and x2)
df = pd.DataFrame(columns = ['target', 'x1', 'x2', 'x3'],
                  data    = [[   0   ,  0  ,  0  ,  1  ],
                             [   1   ,  1  ,  1  ,  0  ],
                             [   2   ,  1  ,  1  ,  1  ]])

model = mass.polr('as.factor(target) ~ .', df, Hess = True) # gives warning below
'''
Warning message:
In polr(as.factor(target) ~ ., data = df, Hess = TRUE) :
  design appears to be rank-deficient, so dropping some coefs

'''

r.predict(model, df, type = "class").__array__() # gives error below
'''
Error in X %*% object$coefficients : non-conformable arguments
'''

The same error actually happens in R as well, but there I can at least see which columns have been discarded by looking at summary(model).

Instead, in Python, r.summary(model).rx2('coefficients') (which should display the same output as summary(model) in R) does not display coefficient names, but just bare values:

array([[4.57292582e+01, 8.25605929e+02, 5.53887231e-02],
       [2.11604944e+01, 2.85721885e+02, 7.40597606e-02],
       [3.19476895e+01, 3.60605165e+02, 8.85946531e-02],
       [5.66312792e+01, 8.93862000e+02, 6.33557296e-02]])

Does anyone know a way to retrieve the coefficient names in Python? Or is there any other workaround?

2

There are 2 best solutions below

0
On

Even without pandas2ri.activate(), the FloatMatrix that gets returned from r.summary(model).rx2('coefficients') does not include the variable names. However, we are able to extract those names with R's dimnames function. Full example below:

import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.conversion import localconverter
mass = importr("MASS")

df = pd.DataFrame(columns = ['target', 'x1', 'x2', 'x3'],
                  data    = [[   0   ,  0  ,  0  ,  1  ],
                             [   1   ,  1  ,  1  ,  0  ],
                             [   2   ,  1  ,  1  ,  1  ]])

with localconverter(ro.default_converter + pandas2ri.converter):
    df = ro.conversion.py2rpy(df)

model = mass.polr('as.factor(target) ~ .', df, Hess = True)

coefs = r.summary(model).rx2('coefficients')

[x for x in r('dimnames')(coefs)[0]]

returns ['x1', 'x3', '0|1', '1|2'], showing x2 has been dropped.

Alternatively, you can print the full model output with r.print(r.summary(model))

0
On

r.summary(model).rx2('coefficients') returns an object without name because you are requesting to convert R objects into pandas (and implicitly numpy) ones earlier in that script (line pandas2ri.activate()). Numpy arrays do not have named elements.

Using activate is no longer recommended. Consider using a local converter in a context instead (example with pandas in the doc: https://rpy2.github.io/doc/v3.3.x/html/generated_rst/pandas.html).