Grammar of data wrangling

Lecture 5

Dr. Mine Çetinkaya-Rundel

Duke University
STA 113 - Fall 2023

Warm up

Reflection

What is one thing you learned from your reading or videos that was “new” to you? And what is one question you have from the reading, videos, or material we’ve covered so far, including the previous application exercise?

Announcements

  • HW 1 due today at 5 pm
  • Subsequent HW deadlines adjusted accordingly for the first part of the semester
  • Next week:
    • Tuesday: Project 1 workday
    • Thursday: Rubenstein Library rare collections visit

GitHub PATs on Posit Cloud

  • In the Console, run usethis::create_github_token() to create a new PAT or grab the one you created previously from a space you might have safely stored it (e.g., 1Password or similar)
  • In the Console, run gitcreds::gitcreds_set() and paste your PAT when prompted.
  • In the Terminal, run git config credential.helper store to make sure your PAT persists throughout the whole time you’re working on this assignment / Cloud project.

Grammar of data wrangling

Packages

library(tidyverse)
library(scales)

Data: Hotel bookings

  • Data from two hotels: one resort and one city hotel

  • Observations: Each row represents a hotel booking

hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Variables

names(hotels)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

glimpse()

glimpse(hotels)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

Plan

We…

have a single data frame

want to slice it, and dice it, and juice it, and process it, so we can plot it

Grammar of data wrangling

  • Start with a data frame
  • And then (|>)
  • Transform it with a function
  • Result in a data frame
df |>
  function(...)

Commonly used data wrangling functions

  • distinct()
  • select(), relocate()
  • arrange(), arrange(desc())
  • slice(), slice_head(), slice_tail(), slice_sample()
  • filter()
  • mutate()
  • summarize(), count()

distcint()

hotels |> 
  distinct(hotel)
# A tibble: 2 × 1
  hotel       
  <chr>       
1 Resort Hotel
2 City Hotel  
hotels |> 
  distinct(is_canceled)
# A tibble: 2 × 1
  is_canceled
        <dbl>
1           0
2           1

select() / relocate()

hotels |> 
  select(hotel)
# A tibble: 119,390 × 1
   hotel       
   <chr>       
 1 Resort Hotel
 2 Resort Hotel
 3 Resort Hotel
 4 Resort Hotel
 5 Resort Hotel
 6 Resort Hotel
 7 Resort Hotel
 8 Resort Hotel
 9 Resort Hotel
10 Resort Hotel
# ℹ 119,380 more rows
hotels |> 
  relocate(is_canceled)
# A tibble: 119,390 × 32
   is_canceled hotel        lead_time
         <dbl> <chr>            <dbl>
 1           0 Resort Hotel       342
 2           0 Resort Hotel       737
 3           0 Resort Hotel         7
 4           0 Resort Hotel        13
 5           0 Resort Hotel        14
 6           0 Resort Hotel        14
 7           0 Resort Hotel         0
 8           0 Resort Hotel         9
 9           1 Resort Hotel        85
10           1 Resort Hotel        75
# ℹ 119,380 more rows
# ℹ 29 more variables:
#   arrival_date_year <dbl>,
#   arrival_date_month <chr>,
#   arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, …

arrange() / arrange(desc())

hotels |> 
  arrange(adr) |>
  relocate(adr)
# A tibble: 119,390 × 32
     adr hotel     is_canceled lead_time
   <dbl> <chr>           <dbl>     <dbl>
 1 -6.38 Resort H…           0       195
 2  0    Resort H…           0       342
 3  0    Resort H…           0       737
 4  0    Resort H…           0        32
 5  0    Resort H…           0       111
 6  0    Resort H…           0         0
 7  0    Resort H…           0         8
 8  0    Resort H…           0         8
 9  0    Resort H…           1        57
10  0    Resort H…           0        57
# ℹ 119,380 more rows
# ℹ 28 more variables:
#   arrival_date_year <dbl>,
#   arrival_date_month <chr>,
#   arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, …
hotels |> 
  arrange(desc(adr)) |>
  relocate(adr)
# A tibble: 119,390 × 32
     adr hotel     is_canceled lead_time
   <dbl> <chr>           <dbl>     <dbl>
 1 5400  City Hot…           1        35
 2  510  City Hot…           0         0
 3  508  Resort H…           0         1
 4  452. City Hot…           0        81
 5  450  Resort H…           1       378
 6  437  Resort H…           1        59
 7  426. Resort H…           0        31
 8  402  Resort H…           0       104
 9  397. Resort H…           0        26
10  392  Resort H…           1       116
# ℹ 119,380 more rows
# ℹ 28 more variables:
#   arrival_date_year <dbl>,
#   arrival_date_month <chr>,
#   arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, …

slice()

hotels |>
  slice(1:4)
# A tibble: 4 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

slice_head() / slice_tail()

hotels |> 
  slice_head(n = 5)
# A tibble: 5 × 32
  hotel        is_canceled lead_time
  <chr>              <dbl>     <dbl>
1 Resort Hotel           0       342
2 Resort Hotel           0       737
3 Resort Hotel           0         7
4 Resort Hotel           0        13
5 Resort Hotel           0        14
# ℹ 29 more variables:
#   arrival_date_year <dbl>,
#   arrival_date_month <chr>,
#   arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, …
hotels |> 
  slice_tail(n = 5)
# A tibble: 5 × 32
  hotel      is_canceled lead_time
  <chr>            <dbl>     <dbl>
1 City Hotel           0        23
2 City Hotel           0       102
3 City Hotel           0        34
4 City Hotel           0       109
5 City Hotel           0       205
# ℹ 29 more variables:
#   arrival_date_year <dbl>,
#   arrival_date_month <chr>,
#   arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, …

slice_sample()

hotels |> 
  slice_sample(n = 5)
# A tibble: 5 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           1        84              2017 August            
2 City Hotel             1       181              2016 May               
3 City Hotel             0         0              2015 September         
4 City Hotel             1       433              2016 September         
5 Resort Hotel           0         0              2016 February          
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

filter()

hotels |>
  filter(adr > 1000) |>
  select(adr, hotel)
# A tibble: 1 × 2
    adr hotel     
  <dbl> <chr>     
1  5400 City Hotel
hotels |> 
  filter(babies > 5) |>
  select(adults, babies)
# A tibble: 2 × 2
  adults babies
   <dbl>  <dbl>
1      2     10
2      1      9

mutate()

hotels |>
  mutate(people = adults + babies + children) |>
  relocate(people) |>
  arrange(desc(people))
# A tibble: 119,390 × 33
   people hotel       is_canceled lead_time arrival_date_year arrival_date_month
    <dbl> <chr>             <dbl>     <dbl>             <dbl> <chr>             
 1     55 Resort Hot…           1       338              2015 October           
 2     50 Resort Hot…           1       336              2015 September         
 3     40 Resort Hot…           1       304              2015 September         
 4     27 Resort Hot…           1       349              2015 September         
 5     27 Resort Hot…           1       352              2015 September         
 6     26 Resort Hot…           1       333              2015 September         
 7     26 Resort Hot…           1       340              2015 September         
 8     26 Resort Hot…           1       347              2015 September         
 9     26 Resort Hot…           1       354              2015 September         
10     26 Resort Hot…           1       361              2015 October           
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …

summarize()

hotels |>
  summarize(mean_adr = mean(adr))
# A tibble: 1 × 1
  mean_adr
     <dbl>
1     102.
hotels |>
  group_by(hotel) |>
  summarize(mean_adr = mean(adr))
# A tibble: 2 × 2
  hotel        mean_adr
  <chr>           <dbl>
1 City Hotel      105. 
2 Resort Hotel     95.0

count()

hotels |>
  count(hotel)
# A tibble: 2 × 2
  hotel            n
  <chr>        <int>
1 City Hotel   79330
2 Resort Hotel 40060

Average cost of daily stay

What are the variables represented in this visualization?

Application exercise

ae-05

  • Go to the course GitHub org and find your ae-05-hotels (repo name will be suffixed with your GitHub name).
  • Clone the repo in Posit Cloud, and set up your PAT:
    • In the Console, run usethis::create_github_token() to create a new PAT or grab the one you created previously from a space you might have safely stored it (e.g., 1Password or similar)
    • In the Console, run gitcreds::gitcreds_set() and paste your PAT when prompted.
    • In the Terminal, run git config credential.helper store to make sure your PAT persists throughout the whole time you’re working on this assignment / Cloud project.
  • Open the Quarto document (.qmd) and follow along and complete the exercises.
  • Render, commit, and push your edits as you work through it