Python Fraud Detection Classification Algorithms

1k Views Asked by At

I am working on a credit card fraud detection model and have labeled data containing orders for an online store. The columns I am working with is: Customer Full Name, Shipping Address and Billing Address (city, state, zip, street), Order Quantity, Total Cost, and an indicator on whether or not the order was discovered to be fraud.

The problem is the fact that 98%+ of the transactions are not fraudulent- the data set is highly imbalanced. I understand this is a classification problem, however I am unsure where to start with the columns I am working with and the imbalance of the data.

I would appreciate any suggestions of appropriate classification algorithms for this use case and how to deal with the imbalanced data. There are several articles I found when searching for how to solve this, however most are working with a Kaggle dataset that has very different columns (due to security reasons for not allowing the information to be public).

Thanks!

2

There are 2 best solutions below

0
On

I suggest to read these articles:

  1. https://towardsdatascience.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9
  2. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

based on my experience xgboost was very good. But you should have very good features that it can build good trees

0
On

In my mind, there are 2 directions for deal with the imbalanced dataset for anti-fraud cases:

  1. Using Supervised ML algorithms for Fraud prediction: try to predict a class (fraud/not fraud) of sample
  2. Using Unsupervised ML algorithms Anomaly detection: try to detect unusual customer/merchant behavior or payments activity.

Supervised Learning (SL) approach

If you use Supervised ML algorithms (e.g. Logistic regression, Random forest, Gradient Boosted Trees) then you need to apply one or more tricks:

  1. Before training ML model:

    • Oversampling - adding more samples of the minority class: RandomOverSampler, SMOTE (generate synthetic samples) methods in imblearn package
    • Undersampling - removing some observations of the majority class: RandomUnderSampler method in imblearn package
    • Combine Oversampling and Undersampling methods.
  2. While training ML model:

    • Pass weights parameter in the train model method (set higher weights to minor class samples).
  3. After training ML model:

    • Do not use accuracy to estimate the trained model
    • Use recall, precision, F1 score, or/and AUC PR (precision-recall curve ) to robust model evaluation.

Unsupervised Learning (UL) approach

Unsupervised algorithms don't require the label in dataset. That's a reason why there is no imbalanced classes problem.

But unlike the SL-based models, UL-based models haven't prediction as output. You need additional actions to interpret output of UL-based models.

The following algorithms most probably will be useful:

  1. Anomaly detection methods:
    • One-class SVM
    • Isolation forest or iForest
    • Local Outlier Factor
  2. Neural Networks methods:
    • Autoencoder-based networks, e.g. AE, VAE
    • DBN or Deep Belief Network,
    • GAN or Generative Adversarial Networks
    • Self-organized Maps.