How to evaluate hashtag recommendation system using MAP@K and MAR@K?

264 Views Asked by At

I am working on building a hashtag recommendation system. I am looking into what are the best ways I can evaluate the system.

The problem statement is: For a given hashtag I need to recommend most relevant (3 or 5) hashtags to the user.

The dataset contains post id in each row. And each row contains the hashtag contained in the post.

post_id hashtags
1 100001 #art #artgif #fanart #digitalArt

These are the steps I have followed.

  1. Preprocessed the hashtag data.
  2. Trained a fastText model on the entire hashtag corpus.
  3. Generate word embeddings of all the hastags.
  4. Use K Nearest Neighbor to recommend hashtags.

I am trying to evaluate the model using MAP@K.

So for each unique hashtag I check what are the top 3 or top 5 recommendations from the model and then compare with what are the actual hashtags that occurred with those hashtags.

I am using MAP@K to evaluate the recommendations and treating the recommendation like a ranking task. Since a user has a finite amount of time and attention, so we want to know not just three tags they might like, but also which are most liked or which we are most confident of. For this kind of task we want a metric that rewards us for getting lots of “correct” or relevant recommendations, and rewards us for having them earlier on in the list (higher ranked). Hence MAP@K (K=3 or 5) [Not finalised the value of K].

Below table shows how I am evaluating my recommendation for each hashtag.

post_id query_hashtag hashtags recommended_hashtags
1 100001 #art #art #artgif #fanart #digitalArt #amazingArt #artistic #artgif
1 100001 #artgif #art #artgif #fanart #digitalArt #fanArt #artistic #artgif
1 100001 #fanart #art #artgif #fanart #digitalArt #art #wallart #fans
1 100001 #digitalArt #art #artgif #fanart #digitalArt #crypto #nft #artgif

I am basically looking for answers to 4 questions.

  1. Am I moving in the right direction to evaluate the hashtag recommendations?
  2. Should Calculate the MAP@K on the entire dataset (which I cam currently doing) or split the dataset into training and testing set and calculate the metric. In case I decide to split the dataset. Should I also restrict the hashtags to be seen by the model from the testing data? I am unable to figure this out.
  3. What value of MAP@K is good enough for 5 recommendations, I am getting approximately 0.12 for MAP@5
  4. Any other evaluation metric that can help me to understand the quality of recommendations
1

There are 1 best solutions below

0
Pat Ferrel On

Answers:

  1. perhaps, read-on
  2. "cross-validation" tests like MAP@k require that the data is split into "test" and "training" data. save 20% of the data for the "test" part then train the model on the rest. For the "test" set get a hashtag and make the query of the model. For every time the query returns a tag associated with the "test" datum you have a positive result. This allows you to calculate MAP@k. You can perform subsequent splits to use all data and combine the results but this is usually not necessary.
  3. there is no fixed "good" for MAP@k. Find MAP@k for a random dataset as well as using your dataset to create "popular" hashtags. Using random and popular tags will give you 2 more MAP@k results. These should be significantly lower that the recommender MAP@k. Also the MAP@k for recs can be used as a baseline for future improvements, like changes to word embeddings. Better than the baseline means you have have a better recommender.
  4. results with humans are the best metric since a recommender is trying to guess what humans are interested in. This requires an A/B test for 2 variants, like random and recs -- or no recs and recs. Set your test up with where the app has no recs or random recs. This will be the "A" part and the "B" will be using your recs. If you get significantly more clicks using "B" you have clearly improved results for you app -- this assumes your app considers more clicks to the the thing to optimize. If you want to optimize time-on-site, then replace your metric for the A/B test.