I am working on a credit card fraud detection model and have labeled data containing orders for an online store. The columns I am working with is: Customer Full Name, Shipping Address and Billing Address (city, state, zip, street), Order Quantity, Total Cost, and an indicator on whether or not the order was discovered to be fraud.
The problem is the fact that 98%+ of the transactions are not fraudulent- the data set is highly imbalanced. I understand this is a classification problem, however I am unsure where to start with the columns I am working with and the imbalance of the data.
I would appreciate any suggestions of appropriate classification algorithms for this use case and how to deal with the imbalanced data. There are several articles I found when searching for how to solve this, however most are working with a Kaggle dataset that has very different columns (due to security reasons for not allowing the information to be public).
Thanks!
I suggest to read these articles:
based on my experience xgboost was very good. But you should have very good features that it can build good trees