Finding the Needle in the Haystack

Posted on June 9^th, 2019

library(tidyverse)

One of the challenges in any kind of prediction problem is understand the impact of a) not identifying the target and b) the impact of falsely indentifying the target. To put it into context, if you are trying to use an algorithm to indentify those with ebola, whats the risk of missing someone (they could infect others with the disease) vs identifying someone who does not have the disease as having the disease (they get quarantined and have their life disrupted). Which is worse? That isn’t a statistical question, it is a context, and even an ethical question (ok, yes, you could also apply a loss function here and use that to find a global minimum value should it exist, but then again that is a choice).

Typically we use two terms to talk about this problem, sensitivity (of all “targets” how many did the model correctly identify) and specificity (of those who were not targets, what proportion were correctly identified as not a target).

Data Generating Process

As always, I want to build some simulated data to understand this problem. Let’s assume that we are trying to identify methanol levels in moonshine. Moonshine is typically homemade, illicit, high alcohol spirits. The alcohol level is increase through distillation. Ethanol, the desired alcohol boils at something like 78 deg C while methanol boils at 65 deg C. Methanol has some toxic side effects so you really don’t want any in your cocktail. So let’s make some fake data with some measurements.

n <- 500

pot_temp <- rnorm(n, 78, 10)
mash_wt <- rnorm(n, 50, 2)
ambient_temp <- rnorm(n, 30, 7)
ambient_humidity <- rnorm(n, 82, 12)

methanol_content <- ifelse(pot_temp <65, .5 * mash_wt - ambient_humidity*(ambient_humidity)/1000,
                           .2* mash_wt - ambient_humidity*(ambient_humidity)/1000)

moonshine_data <- tibble(pot_temp,
                         mash_wt,
                         ambient_temp,
                         ambient_humidity,
                         methanol_content)

Let’s also assume that methanol content greater than 17 is toxic (these numbers are completely made up).

moonshine_data <- moonshine_data %>%
  mutate(toxicity = ifelse(methanol_content > 17, 1,0))

So Let’s see what we have in our data:

moonshine_data %>%
  count(toxicity) %>%
  mutate(perc = n/sum(n) )%>%
  mutate(perc = scales::percent(perc)) %>%
  knitr::kable(booktab = TRUE, col.names = c("Toxicity", "Count", "Percent"),
               caption = "Incidence of Toxicity in Total Data Set")

Toxicity	Count	Percent
0	456	91%
1	44	9%

Incidence of Toxicity in Total Data Set

Yikes! We have a raw case problem. Given our data, what we are targetting doesn’t happen very often. So let’s see what happens when we try to model this:

dat_training <- sample_frac(moonshine_data, .7)
dat_testing <- setdiff(moonshine_data, dat_training)

We could build a very simple binomial regression, but it could guess that every sample will pass and be correct >90% of the time!

dat_training %>%
  count(toxicity) %>%
  mutate(perc = n/sum(n) )%>%
  mutate(perc = scales::percent(perc)) %>%
  knitr::kable(booktab = TRUE, col.names = c("Toxicity", "Count", "Percent"),
               caption = "Incidence of Toxicity in Training Data")

Toxicity	Count	Percent
0	315	90%
1	35	10%

Incidence of Toxicity in Training Data

[1] 0.001554292

Let’s see what our model extracted:

coef(fit_1, s = "lambda.min")

5 x 1 sparse Matrix of class "dgCMatrix"
                          s1
(Intercept)      59.38473176
ambient_humidity -0.14553668
ambient_temp     -0.05737951
mash_wt          -0.06357545
pot_temp         -0.67084896

library(yardstick)

preds_1 <- predict(fit_1, newx = as.matrix(select(dat_testing,
                                     ambient_humidity:pot_temp)), type = "class")

preds_2 <- predict(fit_1, newx = as.matrix(select(dat_testing,
                                     ambient_humidity:pot_temp)), type = "response")

new_dat <- tibble(truth = dat_testing[["toxicity"]], pred = preds_1) %>%
  mutate_all(factor, label = c("Positive", "Negative"), levels = c(1,0)) %>%
  bind_cols(num_pred = preds_2[,1])


table(new_dat $</span>pred<span class="punctuation separator arguments r">,</span> new_dat<span class="keyword accessor dollar r">$ truth)

           Positive Negative
  Positive        7        3
  Negative        2      138

metrics(new_dat, truth, pred) %>%
  bind_rows(sens(new_dat, truth, pred)) %>%
  bind_rows(spec(new_dat, truth, pred)) %>%
  knitr::kable(digits = 2, caption = "Estimates from Elastic Net Fit")

.metric	.estimator	.estimate
accuracy	binary	0.97
kap	binary	0.72
sens	binary	0.78
spec	binary	0.98

Estimates from Elastic Net Fit

So the estimates looks good for sensitivity meaning that the model only correctly identified a little over half of the toxic batches. Is this good enough? Well for a consumer with no knowledge about the potential risk, I don’t think so. The next steps then would be to tune the model to improve the sensitivity in order to reach an acceptable level. This could be as easy as accepting a higher false positive rating. We can look at this with the ROC curve.

roc_curve(new_dat, truth = truth, num_pred) %>%
  ggplot(aes(1-sensitivity, specificity))+
  geom_line()+
  theme_minimal()

Still not great. So then what do we do? Accept a higher false positive rate? Or now to we build a better model with different data? What if we could inform the brewer about the process parameters that matter. Then we could fix the problem at the source. Perhaps that’s the best bet….

Citation

BibTex citation:

@online{dewitt2019
author = {Michael E. DeWitt},
title = {Finding the Needle in the Haystack},
date = 2019-06-09,
url = {https://michaeldewittjr.com/articles/2019-06-09-finding-the-needle-in-the-haystack},
langid = {en}
}

For attribution, please cite this work as:

Michael E. DeWitt. 2019. "Finding the Needle in the Haystack." June 9, 2019. https://michaeldewittjr.com/articles/2019-06-09-finding-the-needle-in-the-haystack