I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.
Short answer: it is Wilcoxon-Mann-Whitney statistic, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier.
f
in wiki corresponds top
in your data andt_0
andt_1
are indexes in the data frame. Note that we first sort byp
, which makes our life easier.Note that the double sum may be decomposed as
Here
#
stands for the total number of such indexes.For each row index
t_1
(such thaty(t_1) =1
), how manyt_0
are such thatp(t_0) < p(t_1)
andy(t_0)=0
? We know that there are exactlyt_1
values ofp
that are less or equal thant_1
because values are sorted. We conclude thatNow imagine scrolling down the sorted dataframe. For the first time we meet
y=1
,#{t_0: t_0 <= t_1 and y(t_0)=1}=1
, for the second time we meety=1
, the same quantity is 2, for the third time we meety=1
, the quantity is 3, and so on. Therefore, when we sum the equality over all indexest_1
wheny=1
, we getwhere
n
is the total number of ones iny
column. Now we need to do one more simplification. Note thatIf
y(t_1)
is not one, it is zero. Therefore,Plugging this to our formula and using that
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.