I am currently working in a project for my MSc and I am having this issue with that dataset. I don't have previous experience in machine learning and this is my first exposure.
In my dataset I started doing my EDA (Exploratory Data Analysis) and I have a categorical feature with missing data which is Province_State. This column has 52360
missing values and as a percentage that is a 5.40%
. I guess that is not too bad and according to what I learnt, I should impute these missing values or delete the column if I have reasonable reasonings.
My logical reasoning is that, not every country has provinces. Therefore that is pretty normal that there are missing values. I clearly don't see a point in imputing these missing values with a random value because that is not logically and it will also lead inaccuracy within the model because we cannot come up with a value which does not practically exist for that particular country.
I think I should do one of the following:
- Impute all the missing values to a constant value such as -1 or "NotApplicable"
- Remove the feature from the dataset
Please help me with a solution and thank you very much in advance.
(This dataset can be accessed from this link)
There are many ways to handle missing data .Deleting the whole column is not a good idea in most cases as you will be discarding information, however if you still want to delete the feature perform univariate analysis on that feature and see if its useful and decide accordingly. Instead of removing the feature you can use any of the following ways: