Data preprocessing of click stream data in real time

199 Views Asked by At

I am working on a project to detect anomalies in web users activity in real-time. Any ill intention or malicious activity of the user has to be detected in real-time. Input data is clickstream data of users. Click data contains user-id ( Unique user ID), click URL ( URL of web page), Click text (Text/function in the website on which user has clicked) and Information (Any information typed by user). This project is similar to an Intrusion detection system (IDS). I am using python 3.6 and I have the following queries,

  1. Which is the best approach to carry out the data preprocessing, Considering all the attributes in the dataset are categorical values.
  2. Encoding methods like hot encoding or label encoding could be applied but data has to be processed in real-time which makes it difficult to apply
  3. As per the requirement of the project 3 columns(click URL, Click Text and Typed information) considered as feature columns.

I am really confused about how to approach data preprocessing. Any insight or suggestions would be appreciated

1

There are 1 best solutions below

0
On

In some recent personal and professional projects when faced with the challenge of applying ML on streaming data I have had success with the python library River https://github.com/online-ml/river.

  1. Some online algorithms can handle labelled values (like hoeffding trees) so depending on what you want to achieve you may not need to conduct preprocessing.

  2. If you do need to conduct preprocessing, label encoding and one hot encoding could be applied in an incremental fashion. Below is some code to get you started. River also has a number of classes to help out with feature extraction and feature selection e.g: TF-IDF, bag of words or frequency aggregations.

online_label_enc = {}

for click in click_stream:
    try:
        label_enc = click[click__feature_label_of_interest]
    except KeyError:
        click[click__feature_label_of_interest] = len(online_label_enc)
        label_enc = click[click__feature_label_of_interest]
  1. I am not sure what you are asking - but if you are approaching the problem online/incrementally then extract the features you want and pass them to your online algorithm of choice - which should then be updating and learning at every data increment.