Visualizing and modeling relationships I

Lecture 10

Dr. Mine Çetinkaya-Rundel

Duke University
STA 113 - Fall 2023

Warm-up

Coming up

  • HW 3 due next Thursday, to be posted later today

  • Due dates for the rest of the semester posted

  • Plans for next week

  • Any questions before we dive into the rest of the semester?

Today’s goals

  • What is a model?
  • Why do we model?
  • What is correlation?
  • How can we leverage visualizations to better understand and evaluate models?

Setup

library(tidyverse)
library(gt)

Visualizing relationships

mpg dataset

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Two categorical variables

What type of plot would you use to visualize the relationship between two categorical variables?

ggplot(
  mpg, 
  aes(x = class, fill = drv)
  ) +
  geom_bar(position = "fill") +
  labs(
    x = "Class",
    y = "Count",
    fill = "Drive type",
    title = "Drive type vs. class"
  )

One categorical, one numerical variable

What type of plot would you use to visualize the relationship between one numerical and one categorical variables?

ggplot(
  mpg, 
  aes(x = class, y = cty)
  ) +
  geom_boxplot() +
  labs(
    x = "Class",
    y = "City mileage (MPG)",
    title = "City mileage vs. class"
  )

Two numerical variables

What type of plot would you use to visualize the relationship between two numerical variables?

ggplot(
  mpg, 
  aes(x = cty, y = hwy)
  ) +
  geom_point() +
  labs(
    x = "City mileage (MPG)",
    y = "Highway mileage (MPG)",
    title = "Highway vs. city mileage"
  )

Let’s look a little closer

Roughly how many points are there in the plot? How many points are there supposed to be? If there is a discrepancy, what explains it?

Adjust alpha

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5)

Jitter

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_jitter()

Adjust alpha + jitter

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_jitter(alpha = 0.5)

Add more jitter

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_jitter(alpha = 0.5, width = 3, height = 3)

Bin the data

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_hex(bins = 15)

Contour 2D density

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_density_2d()

Filled contour 2D density

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_density_2d_filled()

Modelling

Modelling cars

  • What is the relationship between cars’ city and highway mileage?
  • What is your best guess for a car’s highway MPG that gets 20 MPG in the city?

Modelling

  • Use models to explain the relationship between variables and to make predictions
  • For now we will focus on linear models (but there are many many other types of models too!)

Modelling vocabulary

  • Predictor (explanatory variable)
  • Outcome (response variable)
  • Regression line
    • Slope
    • Intercept
  • Correlation

Predictor (explanatory variable)

cty hwy
18 29
21 29
20 31
21 30
16 26
18 26
... ...

Outcome (response variable)

cty hwy
18 29
21 29
20 31
21 30
16 26
18 26
... ...

Regression line

Regression line: slope

Regression line: intercept

Correlation

Correlation

  • Ranges between -1 and 1.
  • Same sign as the slope.

Guess the correlation

Are you good at guessing correlation?

Play the game!

Application exercise

New computing access: Duke containers

  • Go to https://cmgr.oit.duke.edu/containers
  • Find STA 101 on the list, and reserve a container
  • Click on the STA 101 container under “My reservation”, then click on Login, then Start

Set up your SSH key

You will authenticate GitHub using SSH. Below are an outline of the authentication steps.

You only need to do this authentication process one time on a single system.

  • Type credentials::ssh_setup_github() into your console.
  • R will ask “No SSH key found. Generate one now?” You should click 1 for yes.
  • You will generate a key. It will begin with “ssh-rsa….” R will then ask “Would you like to open a browser now?” You should click 1 for yes.
  • You may be asked to provide your GitHub username and password to log into GitHub. After entering this information, you should paste the key in and give it a name. You might name it in a way that indicates where the key will be used, e.g., sta113).

You can find more detailed instructions here if you’re interested.

Configure Git

Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name", 
  user.email = "Email associated with your GitHub account"
  )

For example, mine would be

usethis::use_git_config(
  user.name = "Mine Çetinkaya-Rundel", 
  user.email = "cetinkaya.mine@gmail.com"
  )

You are now ready interact with GitHub via RStudio on the Duke Containers!

ae-09

  • Go to the course GitHub org and find your ae-09-fish (repo name will be suffixed with your GitHub name).
  • Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
  • In RStudio, go to FileNew ProjectVersion ControlGit.
  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
  • Click ae-09-fish.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.