I created a CSV file. It contains 250800 rows and 75 columns. I am doing an EDA analysis to use the data for ML.
It looks like this. All of the columns are float or integer values. (df.info()) When I do :
df.dropna()
It removes NaN values but the issue is that columns like protocol lose all unique values and just have one value, same for dstport and this is not something I want, losing data is not welcoming. As suggested here, I did this:
df = df.dropna(subset = ["Protocol","DstPort", "State"])
But the result is the same, still same NaN values, and cannot apply Kmeans clustering for example.
I would like to ask for your suggestion. What should I do? Can I fill these values somehow, but I don't know in which sense? Which machine learning model I should choose?
I found 3 common ways to fill NaN values.
df.fillna((df.mean()), inplace=True)
df[‘col’].fillna(df[‘col’].mode().iloc[0], inplace=True)
df.fillna((df.median()), inplace=True)
I am not sure if this is the correct approach for my data since it is network traffic but just wanted to share.