tl;dr: I have a pipeline defined in a ipynb file working fine but when I tried to encapsulate it in a Class it didnn't worked as expected. I am probably making some mistake in OneHoteEncode. The question is a little long because of the codes but it is probably very simple and straightforward.
EDIT: if I run housing = housing[housing['ocean_proximity'] != "ISLAND"] the code runs without errors. Thus, the problem is really in this category, which probably turns my problem in a really simpler one. I inspected the OneHotEncode again but I don't know what be wrong with it.
EDIT2: running grid_search.fit(housing.iloc[:100], housing_labels.iloc[:100]) gives no error. Also, ...housing.iloc[:1000] gives no errors. But running on [:5000] gives the ValueError: The feature names should match those that were passed during fit. Feature names seen at fit time, yet now missing: cat__ocean_proximity_ISLAND.
This category has very few examples in the dataset. The problem is really on how I am OneHotEncoding it?
The question:
I have a transformation pipeline from an exercise that is written in a jupyter notebook. It works with no error. As an exercise for myself, I am trying to write functions, pipelines, models etc in different python .py files in a way that is more professional. But, when writing the pipeline as a python Class in a .py file an importing as a module in the main .py file, something is not working as it should.
The pipeline as it is written in the notebook:
class ClusterSimilarity(BaseEstimator, TransformerMixin):
def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
self.n_clusters = n_clusters
self.gamma = gamma
self.random_state = random_state
def fit(self, X, y=None, sample_weight=None):
self.kmeans_ = KMeans(self.n_clusters, n_init=10,
random_state=self.random_state)
self.kmeans_.fit(X, sample_weight=sample_weight)
return self # always return self!
def transform(self, X):
return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
def get_feature_names_out(self, names=None):
return [f"Cluster {i} similarity" for i in range(self.n_clusters)]
def column_ratio(X):
return X[:, [0]] / X[:, [1]]
def ratio_name(function_transformer, feature_names_in):
return ["ratio"] # feature names out
def ratio_pipeline():
return make_pipeline(
SimpleImputer(strategy="median"),
FunctionTransformer(column_ratio, feature_names_out=ratio_name),
StandardScaler())
log_pipeline = make_pipeline(
SimpleImputer(strategy="median"),
FunctionTransformer(np.log, feature_names_out="one-to-one"),
StandardScaler())
cat_pipeline = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore"))
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"),
StandardScaler())
preprocessing = ColumnTransformer([
("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
("people_per_house", ratio_pipeline(), ["population", "households"]),
("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
"households", "median_income"]),
("geo", cluster_simil, ["latitude", "longitude"]),
("cat", cat_pipeline, make_column_selector(dtype_include=object)),
],
remainder=default_num_pipeline) # one column remaining: housing_median_age
And how I rewrote this as a Class in a python file:
from sklearn import set_config
set_config(transform_output='pandas')
class Preprocessor(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self._cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
return self
def transform(self, X):
preprocessing = self._preprocessing()
return preprocessing.fit_transform(X)
def _column_ratio(self, X):
ratio = X.iloc[:, 0] / X.iloc[:, 1]
return np.reshape(ratio.to_numpy(), (-1, 1))
def _ratio_name(self, function_transformer, feature_names_in):
return ["ratio"] # feature names out
def _ratio_pipeline(self):
return make_pipeline(
SimpleImputer(strategy="median"),
FunctionTransformer(self._column_ratio, feature_names_out=self._ratio_name),
StandardScaler()
)
def _log_pipeline(self):
return make_pipeline(
SimpleImputer(strategy="median"),
FunctionTransformer(np.log, feature_names_out="one-to-one"),
StandardScaler()
)
def _cat_pipeline(self):
return make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore", sparse_output=False)
)
def _default_num_pipeline(self):
return make_pipeline(SimpleImputer(strategy="median"),
StandardScaler()
)
def _preprocessing(self):
return ColumnTransformer([
("bedrooms", self._ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
("rooms_per_house", self._ratio_pipeline(), ["total_rooms", "households"]),
("people_per_house", self._ratio_pipeline(), ["population", "households"]),
("log", self._log_pipeline(), ["total_bedrooms", "total_rooms", "population",
"households", "median_income"]),
("geo", self._cluster_simil, ["latitude", "longitude"]),
("cat", self._cat_pipeline(), make_column_selector(dtype_include=object)),
],
remainder=self._default_num_pipeline()) # one column remaining: housing_median_age
class ClusterSimilarity(BaseEstimator, TransformerMixin):
def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
self.n_clusters = n_clusters
self.gamma = gamma
self.random_state = random_state
def fit(self, X, y=None, sample_weight=None):
self.kmeans_ = KMeans(self.n_clusters, n_init=10,
random_state=self.random_state)
self.kmeans_.fit(X, sample_weight=sample_weight)
return self # always return self!
def transform(self, X):
return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
def get_feature_names_out(self, names=None):
return [f"Cluster {i} similarity" for i in range(self.n_clusters)]
But it is not working properly. When I call just for test:
preprocessor = Preprocessor()
X_train = preprocessor.fit_transform(housing)
the output for X_train.info() is exactly what it should be. But when I try a gridSearch with:
svr_pipeline = Pipeline([("preprocessing", preprocessor), ("svr", SVR())])
grid_search = GridSearchCV(svr_pipeline, param_grid, cv=3,
scoring='neg_root_mean_squared_error')
grid_search.fit(housing.iloc[:5000], housing_labels.iloc[:5000])
it outputs a warning:
ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- cat__ocean_proximity_ISLAND
UserWarning: One or more of the test scores are non-finite:
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan]
The problem is probabily here:
ValueError: X has 23 features, but SVR is expecting 24 features as input.
The shape of the correct X after passing the pipelin is X.shape: (16521, 24).
That is, I have 24 features after the transformation pipeline. But somehow, when calling the Preprocessor class SVR is only seeing 23 features, not all of the 24: the one that is missing is the ocean_proximity_ISLAND which has only few values in the dataset. That is why when running the gridsearch in only the first 100 or 1000 lines of the dataset it gives no problems, but when running on a sufficient number of lines that the ocean_proximity_ISLAND is seen, it raises this problem.
This warning repeats at every step of the GridSearch and the column it is pointing to is always the same cat__ocean_proximity_ISLAND which comes from the def _cat_pipeline(self): part of the pipeline and is a result of using OneHotEncode
Note: the above code works right on my notebook. Indeed, it a solution for an exercise from the book "Hands-on Machine Learning". The problem is when I try to rewrite the code given by the author in another .py file as a Class so I can import the pipeline and use it. Thus, the problem is probably in some line of the class Preprocessor(TransformerMixin): but I really have no idea where. Since the error comes from the OneHotEncode probably, I think I am misusing it, but I didn't changed the original code in that part. I have no idea how to fix it or why it is saying that it sees the column in the fitting but not in the predicting.
How to fix it and avoid such an error in the future?
First of all, I would update the fit method on your Preprocessor class. In there you are only fitting the k_means estimator but not the ColumnTransformer and its associated objects which also requires fitting.
When calling the CVGridSearch the fit/fit_transform method is called for every piece of the pipeline, but when the fit method is called on the object of the Preprocessing class nothing is fiting the underlying objects (in particular the ColumnTransformer instance). Something like following taking advantage of the fit and transform methods of ColumnTransformer should be a set in the right direction.
Update: Secondly the _preprocessing returns a new instance of ColumnTransformer when called and the fitted version is not being kept anywhere, a simple fix would be to create an instance in the constructor method and then pass this instance to the fit method to keep the fitted instance so that transform can be applied.
If this doesn't fix the issue I would require the dict of the parameters you are trying to optimize (param_grid) since the notation from "cat__ocean_proximity_ISLAND" indicates that "ocean_proximity_ISLAND" is being used as a parameter somewhere in your pipeline.
Finally, allow me to give you this piece of advice form someone who has tried to extend classes form sklearn and tensorflow many times (based on coultless hours of wrestling with the libraries). There is a reason why pipelines are there to combine estimators (in tensorflow something similar happens with the sequential model class). Extending these classes by inheritance / overriding is really really hard. This is because you are inheriting a lot of baggage from these complex classes which you are not accounting for and which are not apparent. The way to go IMO is to use pipelines and if you want go for more complex structures use dependency injection where you create and handle instances of the objects you need inside a container object.
In general I would say that if you go to the sklearn docs for a class like the ColumnTransformer for example, and the hit the blue [source] hyperlink that takes you to the github implentation of any method, and feel confident you can implement something similar you are good to go to try and create a new estimator class that will interoperate with the rest of the classes in the library without breaking anything.
I am not judging your skills, but be cautious in that when you inherit from BaseEstimator and override fit and transform there might be many other thing you also have to implement to make it compatible with other classes such as CVSearchGrid which you might not know about. That is why I would always recommend to stick to pipelines and use dependecy injection instead of inheriting in sklearn as much as possible.