ColumnTransformer Issue

32 Views Asked by At

I am working on a regression model to predict housing sale prices. I split the data into X and y. I needed to preprocess the data so I created pipelines to handle imputing and scaling numeric variables and imputing and encoding categorical variables. The pipelines work as expected when directly transforming the dataframe, but when passed to a ColumnTransformer something changes. The dataset looks different when using the ColumnTransformer as a preprocessor and the DataFrame produced produces an error when passed to my mutual info regression function. ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required. However it works when I do not specify the discrete variables. Any other preprocessed DataFrame works and allows me to specify discrete variables. I need to make sure the mutual_info_regression recognizes discrete variables so it can produce good results.

Here is the code for the preprocessing that produces problems:

#Pipeline to impute missing values and scale numerical variables
numerical_processes = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'constant', fill_value = 0)),
                                       ('scaler', StandardScaler())])
#Pipeline to impute missing values and encode categorical variables
categorical_processes = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'constant', fill_value = 'None')),
                                      ('encoder', ce.TargetEncoder())])

#create a preprocessor that wraps up processes for both numerical and categorical variables
Preprocessor = ColumnTransformer(
    transformers = [('num', numerical_processes, numerical), 
                    ('categorical', categorical_processes, categorical)])

Below is code that does the exact same preprocessing but works (I understand I could just use this, but its for a portfolio so I want to use a preprocessor for neatness):

X_df = X.copy()
X_df[numerical] = numerical_processes.fit_transform(X[numerical])
X_df[categorical] = categorical_processes.fit_transform(X[categorical], y)
X_df.head()

I will provide the code for the mutual_info_regression and the output when I call it here:

from sklearn.feature_selection import mutual_info_regression

def MI(X, y, categorical): 
    mi_scores = mutual_info_regression(X, y, discrete_features = X.columns.get_indexer(categorical),
                                      random_state = 4)
    mi_scores = pd.Series(mi_scores, name = 'Mutual Info', index = X.columns)
    mi_scores = mi_scores.sort_values(ascending = False)
    return mi_scores

print(MI(X_pp, y, categorical))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 print(MI(X_pp, y, categorical))
      2 #Mutual_Information.head()

Cell In[18], line 4, in MI(X, y, categorical)
      3 def MI(X, y, categorical): 
----> 4     mi_scores = mutual_info_regression(X, y, discrete_features = X.columns.get_indexer(categorical),
      5                                       random_state = 4)
      6     mi_scores = pd.Series(mi_scores, name = 'Mutual Info', index = X.columns)
      7     mi_scores = mi_scores.sort_values(ascending = False)

File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:388, in mutual_info_regression(X, y, discrete_features, n_neighbors, copy, random_state)
    312 def mutual_info_regression(
    313     X, y, *, discrete_features="auto", n_neighbors=3, copy=True, random_state=None
    314 ):
    315     """Estimate mutual information for a continuous target variable.
    316 
    317     Mutual information (MI) [1]_ between two random variables is a non-negative
   (...)
    386            of a Random Vector", Probl. Peredachi Inf., 23:2 (1987), 9-16
    387     """
--> 388     return _estimate_mi(X, y, discrete_features, False, n_neighbors, copy, random_state)

File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:304, in _estimate_mi(X, y, discrete_features, discrete_target, n_neighbors, copy, random_state)
    297     y = scale(y, with_mean=False)
    298     y += (
    299         1e-10
    300         * np.maximum(1, np.mean(np.abs(y)))
    301         * rng.standard_normal(size=n_samples)
    302     )
--> 304 mi = [
    305     _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
    306     for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
    307 ]
    309 return np.array(mi)

File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:305, in <listcomp>(.0)
    297     y = scale(y, with_mean=False)
    298     y += (
    299         1e-10
    300         * np.maximum(1, np.mean(np.abs(y)))
    301         * rng.standard_normal(size=n_samples)
    302     )
    304 mi = [
--> 305     _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
    306     for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
    307 ]
    309 return np.array(mi)

File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:161, in _compute_mi(x, y, x_discrete, y_discrete, n_neighbors)
    159     return mutual_info_score(x, y)
    160 elif x_discrete and not y_discrete:
--> 161     return _compute_mi_cd(y, x, n_neighbors)
    162 elif not x_discrete and y_discrete:
    163     return _compute_mi_cd(x, y, n_neighbors)

File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:138, in _compute_mi_cd(c, d, n_neighbors)
    135 c = c[mask]
    136 radius = radius[mask]
--> 138 kd = KDTree(c)
    139 m_all = kd.query_radius(c, radius, count_only=True, return_distance=False)
    140 m_all = np.array(m_all)

File sklearn/neighbors/_binary_tree.pxi:833, in sklearn.neighbors._kd_tree.BinaryTree.__init__()

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:931, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    929     n_samples = _num_samples(array)
    930     if n_samples < ensure_min_samples:
--> 931         raise ValueError(
    932             "Found array with %d sample(s) (shape=%s) while a"
    933             " minimum of %d is required%s."
    934             % (n_samples, array.shape, ensure_min_samples, context)
    935         )
    937 if ensure_min_features > 0 and array.ndim == 2:
    938     n_features = array.shape[1]

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.
0

There are 0 best solutions below