ROC and AUC
Contents
ROC, which stands for receiver operating characteristic, is a commonly used criteria to measure performance of a classifier.
Preliminary knowledge
Before introducing the concept of ROC, we need to firstly know some basics. Here, take a binary classifier as an example(logistic regression, random forest etc.), there are totally four situations when doing prediction.
- TP(true positive): precdict a positive sample as positive
- FP(false positive): predict a negative sample as positive
- TN(true negative): predict a negative sample as negative
- FN(false negative): predict a positive sample as negative
Now we let TP, FP, TN, FN represent the number of samples in corresponding situation and let 1 be positive label and 0 be negative label, then we can get the following table:
Predicted label | ||||
1 | 0 | Aggregate | ||
Ground truth label | 1 | TP | FN | Actual positive(TP+FN) |
0 | FP | TN | Actual negative(FP+TN) | |
Aggregate | Predicted positive(TP+FP) | Predicted negative(FN+TN) | Total(TP+FP+TN+FN) |
Now we give some definitions:
- TPR(True positive rate): $TP/(TP+FN)$
- FPR(False positive rate): $FP/(FP+TN)$
- TNR(True negative rate): $TN/(TN+FP)$
- FNR(False nagative rate): $FN/(FN+TP)$
How to draw a ROC
Generally speaking, ROC is a curve whose X-axis is FPR and Y-axis is TPR. Assume we have a test set $D$ consisting of $m_+$ positive samples and $m_-$ negative samples and a binary classifier(e.g. logistic regression) which can output the probability of a sample being predicted as positive. It is known that in practice we need to set a specific threshod $t$ and samples with probability higher than or equal to $t$ will be regarded as positive. We can keep adjusting $t$ between $0$ and $1$, and calculate **FPR** and **TPR** for each $t$ as coordinates of a point. All points finally get us the ROC curve. Below is a concrete example:
# | Class | Score | # | Class | Score |
---|---|---|---|---|---|
1 | P | 0.99 | 11 | N | 0.55 |
2 | P | 0.97 | 12 | N | 0.50 |
3 | N | 0.80 | 13 | P | 0.49 |
4 | P | 0.75 | 14 | N | 0.47 |
5 | N | 0.70 | 15 | P | 0.45 |
6 | P | 0.69 | 16 | N | 0.40 |
7 | P | 0.65 | 17 | N | 0.35 |
8 | P | 0.63 | 18 | P | 0.33 |
9 | N | 0.62 | 19 | N | 0.22 |
10 | P | 0.55 | 20 | N | 0.11 |
Assume now we have 20 test samples(10 positive and 10 negative), and the column Class is the actual label and column Score is the predicted probability from our classifer. We have sorted them in descending order based on Score. And the final ROC curve we can get is:
The procedure is as follows.
- Firstly we set threshold $t$ as 1, then no samples are treated as positive. In this case, $FPR=FP/(FP+TN)=0/(0+10)=0$ and $TPR=TP/(TP+FN)=0/(0+10)=0$, which gives us the first point A(0.0,0.0).
- Then we set $t$ as $0.99$, then the first sample is treated as positive. In this case, $FPR=FP/(FP+TN)=0/(0+10)=0$ and $TPR=TP/(TP+FN)=1/(1+9)=0.1$, which gives us the second point B(0.0,0.1).
- We can continue setting $t$ as the scores in descending order to get a series of points. Finally, when $t$ is $0$, then all samples are predicted as positive. In this case, $FPR=FP/(FP+TN)=10/(10+0)=1$ and $TPR=TP/(TP+FN)=10/(10+0)=1$, which gives us the end point T(1.0,1.0).
So this is how we draw a ROC curve given a test set and a binary classifier. And we can know that ROC curve always goes through point (0,0) and (1,1).
AUC
What is AUC
After we get the ROC curve of a classifier on a specific test set, then how can we evaluate its performance? We use the so-called AUC which is the area under ROC curve! Generally speaking, the bigger the AUC is, the betther the classifer performs. Specially:
- If AUC=1, then the classifier is perfert, which means it gives all positive samples higher probabilities than negative samples. There also must be one threshold that could make the classifer’s accuracy 100% in this case.
- If 0.5<AUC<1, then the classifier is better than random guess. A proper threshold could make the classifier valuable.
- If AUC=0.5, then the classifier is equavilent to random guess.
- If AUC<0.5, then the classifier is even worse than random guess. However, in this case we can always improve its performance by flipping the polarity of the prediction.
How to calculate AUC
Through observation, we can find the area under the curve can be divided into multiple trapeziums. In the ROC curve, for every adjacent two points $p_i(x_i,y_i)$ and $p_{i+1}(x_{i+1},y_{i+1})$, we draw perpendicular lines to the X-axis through them, then four points $(x_{i},0)$, $(x_{i},y_i)$, $(x_{i+1},y_{i+1})$, $(x_{i+1},0)$ form a trapezium, which is illustrated in following figure.
Assume the test set $D$ consists of positive sample set $D_{+}$ and negative sample set $D_{-}$. And $|D|=m$, $|D_{+}|=m_{+}$, $|D_{-}|=m_{-}$. And ROC curve is formed by a series of points:
$${p_{1}(x_{1},y_{1}),p_{2}(x_{2},y_{2}),…p_{i}(x_{i},y_{i}),…p_{m}(x_{m},y_{m})}.$$
Then AUC can be calculated as:
$$AUC=\frac{1}{2}\sum_{i=1}^{m-1}(x_{i+1}-x_{i})(y_{i+1}+y_{i}) $$
Another interpretation of AUC
We know if the classifier gives all positive samples higher probabilities than negative samples, then AUC is 1. This reminds us that AUC may be correlated with the quality of ranking of test set based on predicted probabilities. In fact, maximizing AUC is equilavent to minimizing the rank loss, which is defined as:
$$l_{rank}=\frac{1}{m_{+}m_{-}} \sum_{x_{+}\in D_{+}}\sum_{x_{-}\in D_{-}} \left( I\Big(f(x_{+}) \lt f(x_{-})\Big)+\frac{1}{2}I\Big(f(x_{+})=f(x_{-})\Big) \right)$$
$I$ is an indicator function which maps $true$ to $1$ and $false$ to $0$. $f$ is a function which maps a sample to its probability of being positive. We consider each pair of positive and nagative samples, if the probability of positive sample is less than negative sample, we get “one” penalty, and if they are equal, we get “half” penalty. Essentially, $l_{rank}$ is the area above the ROC curve. In other words:
$$ AUC = 1-l_{rank}$$
To illustrate this, we need to notice that we move from $p_{i}(x_{i},y_{i})$ to $p_{i+1}(x_{i+1},y_{i+1})$ on the ROC curve when adjusting threshold $t$ from $t_{i}$ to $t_{i+1}$ and lower threshold will make more samples being treated as positive. In this process, $(x_{i+1}-x_{i})\ast m_{-}$ is the number of newly added $FP$ and $(y_{i+1}-y_{i})\ast m_{+}$ is the number of newly added $TP$. For example, when we adjusting threshold from $0.62$ to $0.55$, we move from point $J$ to $K$. And we get $1$ ($(0.4-0.3)\ast 10=1$) $FP$ that is the $11_{th}$ sample, and $1$ ($(0.7-0.6)\ast 10=1$) $TP$ that is the $10_{th}$ sample. Now we can get:
$$1-l_{rank}=1-\frac{1}{m_{+}m_{-}} \sum_{x_{+}\in D_{+}}\sum_{x_{-}\in D_{-}} \left( I\Big(f(x_{+}) \lt f(x_{-})\Big)+\frac{1}{2}I\Big(f(x_{+})=f(x_{-})\Big) \right) $$ $$=\frac{1}{m_{+}m_{-}} \sum_{x_{+}\in D_{+}}\sum_{x_{-}\in D_{-}} \left( I\Big(f(x_{+}) \gt f(x_{-})\Big)+ I\Big(f(x_{+}) \lt f(x_{-})\Big)+I\Big(f(x_{+})=f(x_{-})\Big) \right)$$ $$-\frac{1}{m_{+}m_{-}} \sum_{x_{+}\in D_{+}}\sum_{x_{-}\in D_{-}} \left( I\Big(f(x_{+}) \lt f(x_{-})\Big)+\frac{1}{2}I\Big(f(x_{+})=f(x_{-})\Big) \right)$$ $$=\frac{1}{m_{+}m_{-}} \sum_{x_{+}\in D_{+}}\sum_{x_{-}\in D_{-}} \left( I\Big(f(x_{+}) \gt f(x_{-})\Big)+\frac{1}{2}I\Big(f(x_{+})=f(x_{-})\Big) \right)$$
We need to illustrate this is also the area under the curve. We can decompose the AUC is this way:
We can count the number of pairs of positive and negative samples in which $f(x_{+})\gt f(x_{-})$ in following way. For example, from $p_{A}$ to $p_{C}$, we get $2$ positive samples, then how many negative samples we have with lower probability? Horizontal line segments represent negative samples. For example from $p_{C}$ to $p_{D}$, we have $1$ negative sample and from $p_{E}$ to $p_{F}$, we have $1$ negative sample. So easily, we can get there are $10$ negative samples with lower probability in total. So we can get number of pairs that is $2\ast 10=20$. Divide it by $m_{+}\ast m_{-}=100$, we can get the exact area of the bottom two rectangles.
Also it is not hard to find out why we have $\frac{1}{2}$ when $f(x_{+})=f(x_{-})$ if we consider the triangle under the line segment between $p_{J}$ and $p_{K}$. So now intuitively, we can understand $1-l_{rank}$ is just the $AUC$.
From this perspective, AUC can be understood as the probability that a positive sample has a higher predicted score than a negative sample if they are randomly picked.
Python script to draw ROC
|
|