We have a Dataset that is in sparse representation and has 25 features and 1 binary label. For example, a line of dataset is:
Label: 0
exid: 24924687
Features:
11:0 12:1 13:0 14:6 15:0 17:2 17:2 17:2 17:2 17:2 17:2
21:11 21:42 21:42 21:42 21:42 21:42
22:35 22:76 22:27 22:28 22:25 22:15 24:1888
25:9 33:322 33:452 33:452 33:452 33:452 33:452 35:14
So, sometimes features have multiple values and they can be the same or different, and the website says:
Some categorical features are multi-valued (order does not matter)
We don't know what is the semantic of features and the value that have been assigned to them (because of some privacy concern they are hidden to public)
We only know:
Label
means if the user has clicked on the recommended ad or not.Features
are describing the product that has been recommended to user.Task
is to predict the probability of getting a click by the user, given an ad of a product.
Any comment on the following problems are appreciated:
- What's the best way to import this kind of datasets into a Python data structure.
- How to deal with multi-valued features, specially when they have similar values repeated
k
times?
That is very general question but as far as I can tell, if you want to aim to use some ML methods its sensible to transform the data into a tidy data format first.
As far I cant tell from the documentation that @RootTwo nicely references in his comment, you are actually dealing with two datasets: one example flat table and one product flat table. (You can later join the two to get one table if so desired.)
Let us first create some parsers that decode the different lines into somewhat informative data structure:
For lines with examples we may use:
This method is hacky but gets the job done: parse features and cast to numbers where possible. The output does look like:
Next are the product examples. As you mentioned, the proble is the multiple occurance of values. I think it sensible to aggregate unique feature-value pair by their frequency. Information does not get lost, but it helps us to encode of tidy sample. That should address your second question.
that basically extracts the label and features for each example (example for line 40):
So when you process your stream line by line, you can decide whether to map an example or a product:
I've decided to do a generator here because it will benefit processing data the functional way if you decide to not use
pandas
. Otherwise a list compresion will be your fried.Now the for the fun part: we read the lines from a given (example) url one by one and assign them into their corresponding datasets (example or product). I will use
reduce
here, because it is fun :-) . I'll not go into detail whatmap/reduce
actually does (thats up to you). You can always use a simple for loop instead.From here you can cast your datasets into a tidy dataframe that you can use to apply machine learning. Beware of
NaN
/missing values, distributions, etc. You can join the two datasets withmerge
to get one big flat table of samples X features. Then you will be more or less able use different methods from e.g.scikit-learn
.Examples dataset
Product dataset (
product_dataset.sample(10)
)Be mindful about the
product_dataset
. You can 'pivot' you features in rows as columns (see reshaping docs).