Confusion about (Mean) Average Precision

2.4k Views Asked by At

In this question I asked clarifications about the precision-recall curve.

In particular, I asked if we have to consider a fixed number of rankings to draw the curve or we can reasonably choose ourselves. According to the answer, the second one is correct.

However now I have a big doubt about the Average Precision (AP) value: AP is used to estimate numerically how good is our algorithm given a certain query. Mean Average Precision (MAP) is average precision on multiple queries.

My doubt is: if AP changes according to how many objects we retrieve then we can tune this parameter to our advantage so we show the best AP value possible. For example, supposing that the p-r curve performs wonderfully until 10 elements and then horribly, we could "cheat" computing the (M)AP value considering only the first 10 elements.

I know that this could sound confusing, but I didn't find anything about this anywhere.

2

There are 2 best solutions below

1
On BEST ANSWER

AP is the area under the precision-recall curve, and the precision-recall curve is supposed to be computed over the entire returned ranked list.

It is not possible to cheat the AP by tweaking the size of the returned ranked list. AP is the area below the precision-recall curve which plots precision as a function of recall, where recall is the number of returned positives relative to the total number of positives that exist in the ground truth, not relative to the number of positives in the returned list. So if you crop the list, all you are doing is that you are cropping the precision-recall curve and ignoring to plot its tail. As AP is the area under the curve, cropping the list reduces the AP, so there is no wisdom in tweaking the ranked list size - the maximal AP is achieved if you return the entire list. You can see this for example from the code you cited in your other question - cropping the list simply corresponds to

for ( ; i<ranked_list.size(); ++i) {

changing to

for ( ; i<some_number; ++i) {

which results in fewer increments of ap (all increments are non-negative as old_precision and precision are non-negative and recall is non-decreasing) and thus smaller AP value.

In practice, for purely computational reasons, you might want to crop the list at some reasonable number, e.g. 10k, as it is unlikely that AP will change much since precision@large_number is likely to be 0 unless you have an unusually large number of positives.

Your confusion might be related to the way some popular function, such as VLFeat's vl_pr compute the precision-recall curves as they assume that you've provided them the entire ranked list and therefore compute the total number of positives in the ground truth by just looking at the ranked list instead of the ground truth itself. So if you used vl_pr naively on cropped lists you could indeed cheat it, but that would be an invalid computation. I agree it's not 100% clear from the description of the function, but if you examine the documentation in more detail, you'll see it mentions NUMNEGATIVES and NUMPOSITIVES, so that if you are giving an incomplete ranked list you should set these two quantities to let the function know how to compute the precision-recall curve / AP properly. Now if you plot different crops of a ranked list using vl_pr but with the same NUMNEGATIVES and NUMPOSITIVES for all function calls, you'll see that the precision-recall curves are just crops of each other, as I was explaining above (I haven't checked this yet as I don't have matlab here, but I'm certain it's the case and if it's not we should file a bug).

2
On

What you said is partially correct. If you get reasonable MAP or AP in top N retrieved documents, its fine. Its not cheating because your IR system is retrieving good number of relevant documents in top N returned documents but yes its still missing some relevant docs. Note that for an IR system its better if it can't retrieve all relevant documents but rank all the retrieved relevant documents in higher rank and thats what AP measures. (higher rank means rank 1 or 2 instead of 100 or 101)

Consider an example, you have two relevant documents, one is returned at rank 1 and the other one is returned at rank 50. Now, if you compute MAP or AP for top 10 returned documents, then you must report the answer as MAP@10 or AP@10. Generally AP means average precision over all returned documents but if you consider the top N documents, your metric will be AP@N instead of only AP and note that, its not cheating! But yes if you compute AP@N and report as AP, then you are giving partial information to the readers.

Important fact about MAP is - If a relevant document never gets retrieved, we assume the precision corresponding to that relevant document to be zero. While computing AP, we divide accumulated precision by total relevant documents. So, when you are computing MAP@N or AP@N It means you only care about the top N returned documents by the IR system. For example, i have used MAP@100 in one of my research works.

If you have confusion about AP or MAP, you can see my brief answer explaining them here. Hopefully it will help you to clarify your confusion.