ROC Curves

2019-01-17

by Joseph Rickert

I have been thinking about writing a short post on R resources for working with (ROC) curves, but first I thought it would be nice to review the basics. In contrast to the usual (usual for data scientists anyway) machine learning point of view, I’ll frame the topic closer to its historical origins as a portrait of practical decision theory.

ROC curves were invented during WWII to help radar operators decide whether the signal they were getting indicated the presence of an enemy aircraft or was just noise. (O’Hara et al. specifically refer to the Battle of Britain, but I haven’t been able to track that down.)

I am relying comes from James Egan’s classic text signal Detection Theory and ROC Analysis) for the basic setup of the problem. It goes something like this: suppose there is an observed quantity (maybe the amplitude of the radar blip), X, that could indicate either the presence of a meaningful signal (e.g. from a Messerschmitt) embedded in noise, or just noise alone (geese). When viewing X in some small interval of time, we would like to establish a threshold or cutoff value, c, such that if X > c we will we can be pretty sure we are observing a signal and not just noise. The situation is illustrated in the little animation below.

library(tidyverse)
library(gganimate)  #for animation
library(magick)     # to put animations sicde by side

We model the noise alone as random draws from a N(0,1) distribution, signal plus noise as draws from N(s_mean, S_sd), and we compute two conditional distributions. The probability of a “Hit” or P(X > c | a signal is present) and the probability of a “False Alarm”, P(X > c | noise only).

s_mean <- 2  # signal mean
s_sd <- 1.1   # signal standard deviation

x <- seq(-5,5,by=0.01) # range of signal
signal <- rnorm(100000,s_mean,s_sd)
noise <- rnorm(100000,0,1)

PX_n <- 1 - pnorm(x, mean = 0, sd = 1) # P(X > c | noise only) = False alarm rate
PX_sn <- 1 - pnorm(x, mean = s_mean, sd = s_sd) # P(X > c | signal plus noise) = Hit rate

We plot these two distributions in the left panel of the animation for different values of the cutoff threshold threshold.

threshold <- data.frame(val = seq(from = .5, to = s_mean, by = .2))

dist <- 
  data.frame(signal = signal, noise = noise) %>% 
  gather(data, value) %>% 
  ggplot(aes(x = value, fill = data)) +
  geom_density(trim = TRUE, alpha = .5) +
  ggtitle("Conditional Distributions") +
  xlab("observed signal")  + 
  scale_fill_manual(values = c("pink", "blue"))

p1 <- dist + geom_vline(data = threshold, xintercept = threshold$val, color = "red") +
            transition_manual(threshold$val)
p1 <- animate(p1)

And, we plot the ROC curve for our detection system in the right panel. Each point in this plot corresponds to one of the cutoff thresholds in the left panel.

df2 <- data.frame(x, PX_n, PX_sn)
roc <- ggplot(df2) +
  xlab("P(X | n)") + ylab("P(X | sn)") +
  geom_line(aes(PX_n, PX_sn)) +
  geom_abline(slope = 1) +
  ggtitle("ROC Curve") + 
  coord_equal()

q1 <- roc +
        geom_point(data = threshold, aes(1-pnorm(val),
                          1- pnorm(val, mean = s_mean, sd = s_sd)), 
                          color = "red") +
                          transition_manual(val)

q1 <- animate(q1)

(The slick trick of getting these two animation panels to line up in the same frame is due to a helper function from Thomas Pedersen and Patrick Touche that can be found here)

combine_gifs(p1,q1)

Notice that as the cutoff line moves further to the right, giving the decision maker a better chance of making a correct decision, the corresponding point moves down the ROC curve towards a lower Hit rate. This illustrates the fundamental tradefoff between hit rate and false alarm rate in the underlying decision problem. For any given problem, a decision algorithm or classifier will live on some ROC curve in false alarm / hit rate space. Improving the hit rate usually come at the cost of increasing the probability of more false alarms.

The simulation code also lets you vary s_mean, the mean of the signal, Setting this to a large value (maybe 5), will sufficiently separate the signal from the noise, and you will get the kind of perfect looking ROC curve you may be accustomed to seeing produced by your best classification models.

The usual practice in machine learning applications is to compute the area under the ROC curve, AUC. This has become the “gold standard” for evaluating classifiers. Given a choice between different classification algorithms, data scientists routinely select the classifier with the highest AUC. The intuition behind this is compelling: given that the ROC is always a monotone increasing, concave downward curve, the best possible curve will have an inflection point in the upper left hand corner and an AUC approaching one (All of the area in ROC space).

Unfortunately, the automatic calculation and model selection of the AUC discourages analysis of how the properties and weaknesses of ROC curves may pertain to the problem at hand. Keeping sight of the decision theory point of view may help to protect against the spell of mechanistic thinking encouraged by powerful algorithms. Although, automatically selecting a classifier based on the value of the AUC may make good sense most of the time, things can go wrong. For example, it is not uncommon for analysts to interpret AUC as a measure of the accuracy of the classifier. But, the AUC is not a measure of accuracy as a little thought about the decision problem would make clear. The irony here is that there was a time, not too long ago, when people thought it was necessary to argue that the AUC is a better measure than accuracy for evaluating machine learning algorithms. For example, have a look at the 1997 paper by Andrew Bradley where he concludes that “…AUC be used in preference to overall accuracy for ‘single number’ evaluation of machine learning algorithms”.

What does the AUC measure? For the binary classification problem of our simple signal processing example, a little calculus will show that the AUC is the probability that a randomly drawn interval with a signal present will produce a higher X value than a signal interval containing noise alone. See Hand (2009), and the very informative StackExchange discussion for the math.

Also note, that in the paper just cited, Hand examines some of the deficiencies of the AUC. His discussion provides an additional incentive for keeping the decision theory tradeoff in mind when working with ROC curves. Hand concludes:

…it [AUC] is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules.

and goes on to propose the H measure for ranking classifiers. (See the R package hmeasure) Following up on this will have to be an investigation for another day.

Our discussion in this post has taken us part way along just one path through the enormous literature on ROC curves which could not be totally explored in a hundred posts. I will just mention that not long after its inception, ROC analysis was used to establish a conceptual framework for problems relating to sensation and perception in the field of psychophysics (Pelli and Farell (1995)) and thereafter applied to decision problems in Medical Diagnostics, (Hajian-Tilaki (2013)), National Intelligence (McCelland (2011)) and just about any field that collects data to support decision making.

If you are interested in delving deeper into ROC curves, the references in papers mentioned above may help to guide further exploration.