I am working on building a hashtag recommendation system. I am looking into what are the best ways I can evaluate the system.
The problem statement is: For a given hashtag I need to recommend most relevant (3 or 5) hashtags to the user.
The dataset contains post id in each row. And each row contains the hashtag contained in the post.
| post_id | hashtags | |
|---|---|---|
| 1 | 100001 | #art #artgif #fanart #digitalArt |
These are the steps I have followed.
- Preprocessed the hashtag data.
- Trained a fastText model on the entire hashtag corpus.
- Generate word embeddings of all the hastags.
- Use K Nearest Neighbor to recommend hashtags.
I am trying to evaluate the model using MAP@K.
So for each unique hashtag I check what are the top 3 or top 5 recommendations from the model and then compare with what are the actual hashtags that occurred with those hashtags.
I am using MAP@K to evaluate the recommendations and treating the recommendation like a ranking task. Since a user has a finite amount of time and attention, so we want to know not just three tags they might like, but also which are most liked or which we are most confident of. For this kind of task we want a metric that rewards us for getting lots of “correct” or relevant recommendations, and rewards us for having them earlier on in the list (higher ranked). Hence MAP@K (K=3 or 5) [Not finalised the value of K].
Below table shows how I am evaluating my recommendation for each hashtag.
| post_id | query_hashtag | hashtags | recommended_hashtags | |
|---|---|---|---|---|
| 1 | 100001 | #art | #art #artgif #fanart #digitalArt | #amazingArt #artistic #artgif |
| 1 | 100001 | #artgif | #art #artgif #fanart #digitalArt | #fanArt #artistic #artgif |
| 1 | 100001 | #fanart | #art #artgif #fanart #digitalArt | #art #wallart #fans |
| 1 | 100001 | #digitalArt | #art #artgif #fanart #digitalArt | #crypto #nft #artgif |
I am basically looking for answers to 4 questions.
- Am I moving in the right direction to evaluate the hashtag recommendations?
- Should Calculate the MAP@K on the entire dataset (which I cam currently doing) or split the dataset into training and testing set and calculate the metric. In case I decide to split the dataset. Should I also restrict the hashtags to be seen by the model from the testing data? I am unable to figure this out.
- What value of MAP@K is good enough for 5 recommendations, I am getting approximately 0.12 for MAP@5
- Any other evaluation metric that can help me to understand the quality of recommendations
Answers: