Handling NaN / Inf in a numpy dnarray

212 Views Asked by At

Working on a 4D numpy array (array of arrays). Each nested array is of shape (1, 100, 4)

trainset.shape
(159984, 1, 100, 4)

But then within the nested arrays, are found some nan values which I would like to handle. For example the first nested array in trainset contains such:

trainset[0]
array([[[ 7.10669020e-02,  4.91383899e-03, -1.43700407e-02,
          1.52228864e-04],
        [ 7.59807410e-02, -9.45620170e-03,             nan,
          1.35892100e-04],
        [ 6.65245393e-02,             nan,             nan,
          8.98521456e-05],
        [            nan,             nan,             nan,
          1.41090006e-05],
        [            nan,             nan,             nan,
          6.68319391e-06],
        [            nan,             nan,             nan,
         -3.27272689e+01],
        [            nan,             nan,             nan,
         -1.09090911e+01],
        [            nan,             nan,             nan,
          8.25973981e+01],
        [            nan,             nan,             nan,
          1.12207785e+02],
        [            nan,             nan,             nan,
          1.65194797e+02],
        [            nan,             nan,             nan,
          2.25974015e+02],
        [            nan,             nan,             nan,
          2.78961026e+02],
        [ 3.87926649e-03,  1.81274134e-04, -1.08764481e-03,
          3.41298685e+02],
        ...
        [ 4.06054062e-03, -9.06370679e-04,  1.30517379e-03,
          3.10129855e+02]]])

How do I check all arrays in trainset for nan values and where found, replaces that with column's median value?

EDIT

Using:

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')

for data in trainset:
  trainsfrom_data = imp_mean.fit(trainset)

ValueError: Found array with dim 3. Estimator expected <= 2.

gives the indicated error, as above.

1

There are 1 best solutions below

2
On BEST ANSWER

The simplest way would be to use SimpleImputer, and select the median imputing strategy. I am not sure if nan are replaced column-wise or row-wise, you may have to reshape your array before passing it through the SimpleImputer(), and then reshape it back.

To your edit: reshape array into 2D, preserving column size, and then make a reshape to original form. Also, use fit_transform for every column to get the result in one go. Reshape will be something like this:

import numpy as np

A = np.random.rand(15, 1, 100, 4)
print(A.shape)

init_shape = A.shape

B = A.reshape(np.prod(init_shape[:-1]), init_shape[-1])
print(B.shape)

# SimpleImputer goes here

B = B.reshape(init_shape)
print(B.shape)