Context
Goal
Based on a user's previous Post Reactions (Documents & Classes) (FavLike, Fav, Like, Dislike, None) (FavLike to keep classes mutually exclusive; FavDislike is possible, but, for obvious reasons, I don't consider it),
I want to sort a batch of new Posts based on each Reaction probability, and, in the future, maybe also on a combined score based on those probabilities.
For that, I'm currently using Multinomial Naive Bayes and Logistic Regression.
Data & Evaluation
Firstly, there's the RawData:
RawData = {
# <Category>: {<Post>: [<Tag>]}
None: {
1234: ['tag', 'beach', 'photography', 'etc'],
# [...]
},
# [...]
}
It then is processed to Counts:
Counts = {
# Totals = N of Documents in each category, not number of tags
Totals: [671, 587, 1_310, 3_353, 66_994], # Actual data sample
Tags: {
'tag': [0, 0, 0, 4, 41],
'photography': [601, 507, 1_080, 2_552, 56_711],
'etc': [0, 0, 0, 0, 1],
# [...]
}
}
After that, it is preprocessed into Bayes (using log-probabilities to save up on computation during evaluation; also using Laplace smoothing):
Bayes = {
Priors: [−1.59905, −1.61782, −1.48773, −1.20228, −0.06781],
Tags: {
'tag': [−1.69897, −1.69897, −1.69897, -1, −0.075720],
# [...]
}
}
Then, after evaluating a Post, the log-probabilities go through a Standard Scaler and Logistic Regressor to get some actual probabilities.
Things To Consider
- A
Postis a set ofTags(Events)Tagorder doesn't matterTageither is in aPost, or it isn't- The number of
Tagsin aPostis variable
Tagsmay imply another (insignificant problem, though)- 'red_shirt' implies 'shirt'
- Space is limited: After processing the
Raw Data, onlyCountsis used,Bayesis calculated on startup, and when the user reacts to a newPost, itsTagsare added toCountand updated onBayes
Where problems arise
Reactionclasses are severely imbalancedNoneoutnumbers all the others combined (Nonecount is 67k, while the others together crack 4k)
Tagsare really sparse- While some
Tagshave thousands of occurrences, the majority only have a handful, having 0 counts for all but 1 or 2Reactionclasses
- While some
Postswith lots ofTagswill get really low log-probabilities- The mean for each
Reaction: [-307.23473004, -314.77733785, -270.50181329, -206.64016692, -8.65777954] - And their standard deviation: [150.89012448, 154.35610454, 134.04458405, 101.16143351, 6.68529216]
- Sorting
Poststhat way will just essentially sort on the number ofTags - This can be mitigated if we relativise values of the log-probs (I did this with Logistic Regression)
- The mean for each
- Even with Logistic Regression, the final probabilities are still too close to the priors of each category
Previous Iterations
The things I described is where I'm currently at, but I've had one previous attempt that worked more or less decently:
I merged the categories into just two, one positive, one negative:
Positive= 3 *FavLike+ 2 *Fav+ 1 *LikeNegative= 1 *Dislike(At the time, I didn't have access to theNonePosts)
Based on the Reaction of a Post, I added x times more to each Tag's count.
Then, to evaluate, I just subtracted the log-probabilities of Positive and Negative.
Actual Question
Given all that, how can I improve the Reaction prediction of my Recommender System, such that I may sort them on each category?
Final Notes
If it still fits with my restrictions (See Things To Consider), I may consider switching from Naive Bayes entirely, but I prefer to still utilise some of my currently existing work.
I'm currently looking into Complement Naive Bayes, so I'm already aware of that possibility.
I'm a programmer by trade, so I'd appreciate for tips on how to implement your suggestions with code, and please limit the mathematics.
(I'm using JavaScript (Yes, really...), but I'm familiar with several languages: Python, Rust, C#, Java, Go, C, C++ etc)