Visualizing and modeling relationships III

Lecture 12

Dr. Mine Çetinkaya-Rundel

Duke University
STA 113 - Fall 2023



Today’s goals

  • Fit and interpret models for predicting (classifying) binary outcomes

  • Define sensitivity, specificity, and ROC curves

  • Visualize decision boundaries for classification models

Logistic regression

So far in regression

  • Outcome: Numerical, Predictor: One numerical or one categorical with only two levels \(\rightarrow\) Simple linear regression

  • Outcome: Numerical, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Multiple linear regression

  • Outcome: Categorical with only two levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Logistic regression

  • Outcome: Categorical with any number of levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Generalized linear models – Not covered in STA 113FS

Data + packages


hp_spam <- read_csv("data/hp-spam.csv")
  • 4601 emails collected at Hewlett-Packard labs and contains 58 variables

  • Outcome: type

    • type = 1 is spam

    • type = 0 is non-spam

  • Predictors of interest:

    • capitalTotal: Number of capital letters in email

    • Percentages are calculated as (100 * number of times the WORD appears in the e-mail) / total number of words in email

      • george: Percentage of “george”s in email (these were George’s emails)

      • you: Percentage of “you”s in email

Glimpse at data

What type of data is type? What type should it be in order to use logistic regression?

hp_spam |>
  select(type, george, capitalTotal, you)
# A tibble: 4,601 × 4
    type george capitalTotal   you
   <dbl>  <dbl>        <dbl> <dbl>
 1     1      0          278  1.93
 2     1      0         1028  3.47
 3     1      0         2259  1.36
 4     1      0          191  3.18
 5     1      0          191  3.18
 6     1      0           54  0   
 7     1      0          112  3.85
 8     1      0           49  0   
 9     1      0         1257  1.23
10     1      0          749  1.67
# ℹ 4,591 more rows

EDA: How much spam?

hp_spam |>
  count(type) |>
  mutate(p = n / sum(n))
# A tibble: 2 × 3
   type     n     p
  <dbl> <int> <dbl>
1     0  2788 0.606
2     1  1813 0.394

EDA: AM I SCREAMING? capitalTotal

ggplot(hp_spam, aes(x = capitalTotal)) +
EDA: george, is that you?

ggplot(hp_spam, aes(x = george)) +
ggplot(hp_spam, aes(x = you)) +
Logistic regression

  • Logistic regression takes in a number of predictors and outputs the probability of a “success” (an outcome of 1) in a binary outcome variable.

  • The probability is related to the predictors via a link function,

\[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \beta_i x_i })}, \] whose output is in \((0,1)\) (a probability).

  • In this modeling scheme, one typically finds \(\hat{\beta}\) by maximizing the likelihood function, another objective function, different than our previous “least squares” objective.

Logistic regression, visualized

Using data to estimate \(\beta_i\)

To proceed with building our email classifier, we will, as usual, use our data (outcome \(y_i\) and predictor \(x_i\) pairs), to estimate \(\beta\) (find \(\hat{\beta}\)) and obtain the model:

\[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \hat{\beta}_i x_i})}, \]


Ultimate goal: Recreate the following visualization.


