I'm trying to create a solution that will be somehow similar to the Mixed Effects Model (GLMM) that is not present in scikit-learn at the moment. Imagine a simple heart-disease dataset from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset that looks like this:

looks like this

I conduct some standard procedures:

# numerical variables
num_cols = ['age',
            'trestbps',
            'chol',
            'thalach',
            'oldpeak'
            ]

# categorical variables
cat_cols = ['sex',
            'cp',
            'fbs',
            'restecg',
            'exang',
            'slope',
            'ca',
            'thal'
            ]

# changing format of the categorical variables
df[cat_cols] = df[cat_cols].apply(lambda x: x.astype('object'))

# target variable
y = df['target']

# features
X = df.drop(['target'], axis=1)

# data split:

# random seed
np.random.seed(42)

# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    stratify=y)

...and use the standard transformers to pre-process the variables.

# pipeline for numerical data
num_preprocessing = Pipeline([('num_imputer', SimpleImputer(strategy='mean')), # imputing with mean
                              ('minmaxscaler', MinMaxScaler())]) # scaling

# pipeline for categorical data
cat_preprocessing = Pipeline([('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')), # filling missing values
                              ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))]) # One Hot Encoding

# preprocessor - combining pipelines
preprocessor = ColumnTransformer([
    ('categorical', cat_preprocessing, cat_cols),
    ('numerical', num_preprocessing, num_cols)
])

so I can run a logistic regression:

# any model parameters
log_ini_params = {'penalty': 'l2', 
                  'tol': 0.0073559740277086005, 
                  'C': 1.1592424247511928, 
                  'fit_intercept': True, 
                  'solver': 'liblinear'
                  }

# model - Pipeline
log_clf = Pipeline([('preprocessor', preprocessor),
                    ('clf', LogisticRegression(**log_ini_params))]
                  )

# fitting the model
log_clf.fit(X_train, y_train)

And have a look at the named steps:

{'preprocessor': ColumnTransformer(transformers=[('categorical',
                                  Pipeline(steps=[('cat_imputer',
                                                   SimpleImputer(fill_value='missing',
                                                                 strategy='constant')),
                                                  ('onehot',
                                                   OneHotEncoder(drop='first',
                                                                 handle_unknown='ignore'))]),
                                  ['sex', 'cp', 'fbs', 'restecg', 'exang',
                                   'slope', 'ca', 'thal']),
                                 ('numerical',
                                  Pipeline(steps=[('num_imputer',
                                                   SimpleImputer()),
                                                  ('minmaxscaler',
                                                   MinMaxScaler())]),
                                  ['age', 'trestbps', 'chol', 'thalach',
                                   'oldpeak'])]),
 'clf': LogisticRegression(C=1.1592424247511928, solver='liblinear',
                    tol=0.0073559740277086005)}

...in order to extract feature importances:

# numeric labels
nums = log_clf.named_steps['preprocessor'].transformers_[1][2]

# categorical labels
cats = log_clf.named_steps['preprocessor'].transformers_[0][1]['onehot'].get_feature_names_out(cat_cols).tolist()

# features
feats = nums + cats

# adjusting features to columns
features_dict = dict(zip(feats, log_clf.named_steps['clf'].coef_.tolist()[0]))

#  chart
features_df = pd.DataFrame(features_dict, index=[0])
features_df.T.plot.bar(title='Feature Importance', legend=False, figsize=(20,10));

enter image description here

So far so good. However, imagine we deal with a multi-level problem, for example the data contains information about the hospital of each patient:

# seed
np.random.seed(42)

# adding 3 hospitals
df['hospital'] = np.random.randint(1, 3, len(df)).astype('object')

So our data looks like below:

enter image description here

The research problem is to investigate if the coefficients of the regression for some variables are different for observations within each hospital.

The most proper way to deal with this problem is GLMM, using R library I would simply type something like:

model = lmer(target ~ 1 + variable + (1 + variable|hospital), data = df)

However, in Python we're either limited to statsmodels or can conduct some tests like interactions between our variables and the grouping variable (hospital in our case)

The questions are:

  1. How is it possible to include the multiplication step between one-hot-encoded Hospital variable (it would probably be added to the cat_columns or have it's own Pipeline) and other PREPROCESSED variables (selected scaled nums for example age and chol) so new variables like hospital_2_age appear in the Pipeline/ColumnTransformer?

Obviously - there are some methodoligical issues on the way, for example if the scaling should be conducted globally or per grouping variable etc. (but it's not a programming question)

  1. It is not a purely programming question, but if 1. is an overkill - is there a smooth way to implement a GLMM in scikit-learn Pipeline in such workflow?
0

There are 0 best solutions below