Working on a 4D numpy array (array of arrays). Each nested array is of shape (1, 100, 4)
trainset.shape
(159984, 1, 100, 4)
But then within the nested arrays, are found some nan
values which I would like to handle. For example the first nested array in trainset
contains such:
trainset[0]
array([[[ 7.10669020e-02, 4.91383899e-03, -1.43700407e-02,
1.52228864e-04],
[ 7.59807410e-02, -9.45620170e-03, nan,
1.35892100e-04],
[ 6.65245393e-02, nan, nan,
8.98521456e-05],
[ nan, nan, nan,
1.41090006e-05],
[ nan, nan, nan,
6.68319391e-06],
[ nan, nan, nan,
-3.27272689e+01],
[ nan, nan, nan,
-1.09090911e+01],
[ nan, nan, nan,
8.25973981e+01],
[ nan, nan, nan,
1.12207785e+02],
[ nan, nan, nan,
1.65194797e+02],
[ nan, nan, nan,
2.25974015e+02],
[ nan, nan, nan,
2.78961026e+02],
[ 3.87926649e-03, 1.81274134e-04, -1.08764481e-03,
3.41298685e+02],
...
[ 4.06054062e-03, -9.06370679e-04, 1.30517379e-03,
3.10129855e+02]]])
How do I check all arrays in trainset
for nan
values and where found, replaces that with column's median value?
EDIT
Using:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
for data in trainset:
trainsfrom_data = imp_mean.fit(trainset)
ValueError: Found array with dim 3. Estimator expected <= 2.
gives the indicated error, as above.
The simplest way would be to use SimpleImputer, and select the
median
imputing strategy. I am not sure ifnan
are replaced column-wise or row-wise, you may have to reshape your array before passing it through theSimpleImputer()
, and then reshape it back.To your edit: reshape array into 2D, preserving column size, and then make a reshape to original form. Also, use
fit_transform
for every column to get the result in one go. Reshape will be something like this: