I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick).
I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it.
Here is what I do:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
However, df_filled is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
Update
I realized, fancyimpute needs a numpay array. I hence converted the df_numeric to a an array using as_matrix().
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
The
np.arraythat is returned by the.complete()method of the fancyimpute object (be it mice or KNN) is fed as the content(argument data=)of a pandas dataframe whose cols and indexes are the same as the original data frame.