I am trying to re-write the source code of scikit-learn permutation importance to achieve:
- Compatibility with Polars
- Compatibility with clusters of features
import polars as pl
import polars.selectors as cs
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=42,
shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X_train_polars = pl.DataFrame(X_train, schema=feature_names)
X_test_polars = pl.DataFrame(X_test, schema=feature_names)
y_train_polars = pl.Series(y_train, schema=["target"])
y_test_polars = pl.Series(y_test, schema=["target"])
To get future importances for a cluster of feature, we need to permutate a cluster of features simutiousnly then pass into the scorer to compare with the baseline score.
However, I am struglling to replace multiple polars dataframe columns in case of examining clusters of features:
from sklearn.utils import check_random_state
random_state = check_random_state(42)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
X_train_permuted = X_train_polars.clone()
shuffle_arr = np.array(X_train_permuted[:, ["feature_0", "feature_1"]])
random_state.shuffle(shuffle_arr)
X_train_permuted.replace_column( # This operation is in place
0,
pl.Series(name="feature_0", values=shuffle_arr))
Normally the shuffle_arr would have a shape of (n_samples,) which can easily replace assosicated column in polars dataframe using polars.DataFrame.replace_column(). In this case, shuffle_arr has multi-dimensional shape of (n_samples, n_features in a cluster). What would be an efficient way to replace the assosicated columns?
TL;DR
Let's work with a simple example.
X_train_permuted
Shuffle
feature_0andfeature_1Use a list to keep track of the features you are shuffling:
features = ["feature_0", "feature_1"].Replace associated columns in
X_train_permutedwithshuffle_arrvaluespl.DataFrame.with_columnsand pass apl.DataFramewithschema=features.