Lecture 5
Duke University
STA 113 - Fall 2023
What is one thing you learned from your reading or videos that was “new” to you? And what is one question you have from the reading, videos, or material we’ve covered so far, including the previous application exercise?
usethis::create_github_token()
to create a new PAT or grab the one you created previously from a space you might have safely stored it (e.g., 1Password or similar)gitcreds::gitcreds_set()
and paste your PAT when prompted.git config credential.helper store
to make sure your PAT persists throughout the whole time you’re working on this assignment / Cloud project.Data from two hotels: one resort and one city hotel
Observations: Each row represents a hotel booking
hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date (1): reservation_status_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
glimpse()
Rows: 119,390
Columns: 32
$ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type <chr> "Transient", "Transient", "Transient", …
$ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
We…
have a single data frame
want to slice it, and dice it, and juice it, and process it, so we can plot it
distinct()
select()
, relocate()
arrange()
, arrange(desc())
slice()
, slice_head()
, slice_tail()
, slice_sample()
filter()
mutate()
summarize()
, count()
distcint()
select()
/ relocate()
# A tibble: 119,390 × 32
is_canceled hotel lead_time
<dbl> <chr> <dbl>
1 0 Resort Hotel 342
2 0 Resort Hotel 737
3 0 Resort Hotel 7
4 0 Resort Hotel 13
5 0 Resort Hotel 14
6 0 Resort Hotel 14
7 0 Resort Hotel 0
8 0 Resort Hotel 9
9 1 Resort Hotel 85
10 1 Resort Hotel 75
# ℹ 119,380 more rows
# ℹ 29 more variables:
# arrival_date_year <dbl>,
# arrival_date_month <chr>,
# arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, …
arrange()
/ arrange(desc())
# A tibble: 119,390 × 32
adr hotel is_canceled lead_time
<dbl> <chr> <dbl> <dbl>
1 -6.38 Resort H… 0 195
2 0 Resort H… 0 342
3 0 Resort H… 0 737
4 0 Resort H… 0 32
5 0 Resort H… 0 111
6 0 Resort H… 0 0
7 0 Resort H… 0 8
8 0 Resort H… 0 8
9 0 Resort H… 1 57
10 0 Resort H… 0 57
# ℹ 119,380 more rows
# ℹ 28 more variables:
# arrival_date_year <dbl>,
# arrival_date_month <chr>,
# arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, …
# A tibble: 119,390 × 32
adr hotel is_canceled lead_time
<dbl> <chr> <dbl> <dbl>
1 5400 City Hot… 1 35
2 510 City Hot… 0 0
3 508 Resort H… 0 1
4 452. City Hot… 0 81
5 450 Resort H… 1 378
6 437 Resort H… 1 59
7 426. Resort H… 0 31
8 402 Resort H… 0 104
9 397. Resort H… 0 26
10 392 Resort H… 1 116
# ℹ 119,380 more rows
# ℹ 28 more variables:
# arrival_date_year <dbl>,
# arrival_date_month <chr>,
# arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, …
slice()
# A tibble: 4 × 32
hotel is_canceled lead_time arrival_date_year arrival_date_month
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
# ℹ 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, …
slice_head()
/ slice_tail()
# A tibble: 5 × 32
hotel is_canceled lead_time
<chr> <dbl> <dbl>
1 Resort Hotel 0 342
2 Resort Hotel 0 737
3 Resort Hotel 0 7
4 Resort Hotel 0 13
5 Resort Hotel 0 14
# ℹ 29 more variables:
# arrival_date_year <dbl>,
# arrival_date_month <chr>,
# arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, …
# A tibble: 5 × 32
hotel is_canceled lead_time
<chr> <dbl> <dbl>
1 City Hotel 0 23
2 City Hotel 0 102
3 City Hotel 0 34
4 City Hotel 0 109
5 City Hotel 0 205
# ℹ 29 more variables:
# arrival_date_year <dbl>,
# arrival_date_month <chr>,
# arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, …
slice_sample()
# A tibble: 5 × 32
hotel is_canceled lead_time arrival_date_year arrival_date_month
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 1 84 2017 August
2 City Hotel 1 181 2016 May
3 City Hotel 0 0 2015 September
4 City Hotel 1 433 2016 September
5 Resort Hotel 0 0 2016 February
# ℹ 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, …
filter()
mutate()
# A tibble: 119,390 × 33
people hotel is_canceled lead_time arrival_date_year arrival_date_month
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 55 Resort Hot… 1 338 2015 October
2 50 Resort Hot… 1 336 2015 September
3 40 Resort Hot… 1 304 2015 September
4 27 Resort Hot… 1 349 2015 September
5 27 Resort Hot… 1 352 2015 September
6 26 Resort Hot… 1 333 2015 September
7 26 Resort Hot… 1 340 2015 September
8 26 Resort Hot… 1 347 2015 September
9 26 Resort Hot… 1 354 2015 September
10 26 Resort Hot… 1 361 2015 October
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
summarize()
count()
What are the variables represented in this visualization?
ae-05
ae-05-hotels
(repo name will be suffixed with your GitHub name).usethis::create_github_token()
to create a new PAT or grab the one you created previously from a space you might have safely stored it (e.g., 1Password or similar)gitcreds::gitcreds_set()
and paste your PAT when prompted.git config credential.helper store
to make sure your PAT persists throughout the whole time you’re working on this assignment / Cloud project..qmd
) and follow along and complete the exercises.