Passing estimator from Scikit Learn Pipeline to Scikit Survival as_concordance_index_ipcw_scorer

371 Views Asked by At

I have a pipeline running preprocessing and then a Random Survival Forest from the SciKit-Survival package. I am trying to use Scikit-Survival's as_concordance_index_ipcw_scorer() class found here.

My pipeline looks like the following:

    Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['IntVar1', 'IntVar2', 'IntVar3',
       'IntVar4'],
      dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  Index(['CharVar1', 'CharVar2', 'CharVar3'], dtype='object'))])),
                ('randomsurvivalforest',
                 RandomSurvivalForest(max_features='sqrt',
                                      min_samples_leaf=0.005,
                                      min_samples_split=0.01, n_estimators=150,
                                      n_jobs=-1, oob_score=True,
                                      random_state=200))])

This is the python code leading up to the pipeline and the fitting of the pipeline:

print("Importing global DF")
print("Creating X & Y set")
X = df.iloc[:,:-2].copy()
y = Surv.from_dataframe("AliveStatus","Target_Age",df.iloc[:,-2:].copy()) ## Creates structured array for Scikit Surv

print("Defining feature categories by data type")
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

print("Splitting dataset")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) #SkLearn splitter

print("Defining preprocessing steps using SciKitLearn pipeline...")
## Pipeline Steps
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])


categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(sparse=False,handle_unknown='ignore'))]) ## Use "sparse=False" because Random Forests cannot take Spare Matrixes, only Dense Matrixes. 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

## Pipeline defining
print("Defining Random Survival Forest pipeline from SciKit Survival")
rsf = make_pipeline(
    preprocessor,
    RandomSurvivalForest(n_estimators=150, ## Default 100
                        min_samples_split=0.01, ## Default 6
                        min_samples_leaf=0.005, ## Default 3
                        max_features="sqrt", ## Defaults to none when not defined
                        n_jobs=-1, ## Default -1
                        oob_score = True,
                        random_state=200) ## Random State 2020
                        )


##Fitting & Scoring
print("Fitting dataframe to RSF Pipeline")
rsf.fit(X_train,y_train)
print("Fitting completed.")

After the fitting is completed I try to run the following:

as_concordance_index_ipcw_scorer(rsf).score(X_test,y_test)

I get the following error after:

AttributeError                            Traceback (most recent call last)
<ipython-input-97-9a92b22d8026> in <module>
----> 1 as_concordance_index_ipcw_scorer(rsf).score(X_test,y_test)

C:\ProgramData\Anaconda3\lib\site-packages\sksurv\metrics.py in score(self, X, y)
    788         score : float
    789         """
--> 790         estimate = self._do_predict(X)
    791         score = self._score_func(
    792             survival_train=self._train_y,

C:\ProgramData\Anaconda3\lib\site-packages\sksurv\metrics.py in _do_predict(self, X)
    768 
    769     def _do_predict(self, X):
--> 770         predict_func = getattr(self.estimator_, self._predict_func)
    771         return predict_func(X)
    772 

AttributeError: 'as_concordance_index_ipcw_scorer' object has no attribute 'estimator_'

An option I've tried was specifying the RSF section of the pipeline without any success:

as_concordance_index_ipcw_scorer(rsf[1]).score(X_test,y_test)

Any suggestions?

Apologies for length or missing information, I'm new to pipelines & ScikitSurvival and wanted to give as much detail as I see.

Thanks

1

There are 1 best solutions below

0
On

The estimator instance from as_concordance_index_ipcw_scorer needs to be fitted; having fitted the underlying estimator doesn't help in this case.

From the source code (of the Mixin class), fitting one of these wrappers fits the underlying estimator saving it in the new attribute estimator_ (which is what your error complains about being missing), and also saves the training labels. So you might be able to create those attributes directly without adverse effects, but you'd be going around the expected process.