Lecture 12
Duke University
STA 113 - Fall 2023
Fit and interpret models for predicting (classifying) binary outcomes
Define sensitivity, specificity, and ROC curves
Visualize decision boundaries for classification models
Outcome: Numerical, Predictor: One numerical or one categorical with only two levels \(\rightarrow\) Simple linear regression
Outcome: Numerical, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Multiple linear regression
Outcome: Categorical with only two levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Logistic regression
Outcome: Categorical with any number of levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Generalized linear models – Not covered in STA 113FS
4601 emails collected at Hewlett-Packard labs and contains 58 variables
Outcome: type
type = 1
is spam
type = 0
is non-spam
Predictors of interest:
capitalTotal
: Number of capital letters in email
Percentages are calculated as (100 * number of times the WORD appears in the e-mail) / total number of words in email
george
: Percentage of “george”s in email (these were George’s emails)
you
: Percentage of “you”s in email
What type of data is type
? What type should it be in order to use logistic regression?
capitalTotal
george
, is that you
?Logistic regression takes in a number of predictors and outputs the probability of a “success” (an outcome of 1) in a binary outcome variable.
The probability is related to the predictors via a link function,
\[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \beta_i x_i })}, \] whose output is in \((0,1)\) (a probability).
To proceed with building our email classifier, we will, as usual, use our data (outcome \(y_i\) and predictor \(x_i\) pairs), to estimate \(\beta\) (find \(\hat{\beta}\)) and obtain the model:
\[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \hat{\beta}_i x_i})}, \]
ae-11-spam
Ultimate goal: Recreate the following visualization.
ae-11-spam
Reminder of instructions for getting started with application exercises:
ae-11-spam
(repo name will be suffixed with your GitHub name).