Ways to fullfil NaN Values for Intrusion Detection with ML, Unsupervised ML

85 Views Asked by At

I created a CSV file. It contains 250800 rows and 75 columns. I am doing an EDA analysis to use the data for ML. enter image description here

It looks like this. All of the columns are float or integer values. (df.info()) When I do :

df.dropna()

It removes NaN values but the issue is that columns like protocol lose all unique values and just have one value, same for dstport and this is not something I want, losing data is not welcoming. As suggested here, I did this:

df = df.dropna(subset = ["Protocol","DstPort", "State"])

But the result is the same, still same NaN values, and cannot apply Kmeans clustering for example.

I would like to ask for your suggestion. What should I do? Can I fill these values somehow, but I don't know in which sense? Which machine learning model I should choose?

1

There are 1 best solutions below

0
On

I found 3 common ways to fill NaN values.

  • Average: df.fillna((df.mean()), inplace=True)
  • Most Frequent: df[‘col’].fillna(df[‘col’].mode().iloc[0], inplace=True)
  • Median: df.fillna((df.median()), inplace=True)

I am not sure if this is the correct approach for my data since it is network traffic but just wanted to share.