library(tidyverse)
library(tidymodels)
<- read_csv("data/fish.csv") fish
Modelling fish weights with multiple predictors
For this application exercise, we will continue to work with data on fish. The dataset we will use, called fish
, is on two common fish species in fish market sales. We’re going to investigate the relationship between the weights and heights of fish, and later take into consider species as well.
The data dictionary is below:
variable | description |
---|---|
species |
Species name of fish |
weight |
Weight, in grams |
length_vertical |
Vertical length, in cm |
length_diagonal |
Diagonal length, in cm |
length_cross |
Cross length, in cm |
height |
Height, in cm |
width |
Diagonal width, in cm |
Interpreting multiple regression models
In the previous application exercise you saw that the model predicting weight from height and species was a better fit. In this section we will interpret the coefficients of this model.
<- linear_reg() |>
fish_whs_fit fit(weight ~ height + species, data = fish)
tidy(fish_whs_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -828. 69.7 -11.9 1.92e-16
2 height 95.2 4.54 21.0 5.10e-27
3 speciesRoach 343. 41.8 8.19 6.35e-11
- What does each row in the model output represent?
The first row is the intercept. The second row is the slope for height
. The third row is the slope for speciesRoach
. In this case the other species level, Bream, is the reference level.
Interpret the intercept and the slopes.
Intercept, -828: Bream fish that are 0 cm in height are expected to weigh, on average -828 grams. This value doesn’t make sense in context of the data.
height, 95.2: All else held constant, for each cm the height of fish is higher, weights of fish are expected, on average, to be higher by 95.2 grams.
speciesRoach, 343: All else held constant, Roach fish are expected, on averge to weigh 343 grams more than Bream fish, on average.
Write the model.
\[ \widehat{weight} = -828 + 95.2 \times height + 343 \times speciesRoach \]
Additive vs. interaction models
- Run the two code chunks below and create two separate plots. How are the two plots different than each other? Which plot does the model we fit above represent?
ggplot(fish, aes(x = height, y = weight, color = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Plot A - Interaction model")
`geom_smooth()` using formula = 'y ~ x'
<- augment(fish_whs_fit, new_data = fish)
fish_whs_aug
ggplot(
fish_whs_aug, aes(x = height, y = weight, color = species)
+
) geom_point(alpha = 0.5) +
geom_smooth(aes(y = .pred), method = "lm") +
labs(title = "Plot B - Additive model")
`geom_smooth()` using formula = 'y ~ x'
- Look back at Plot B. What assumption does the additive model make about the slopes between flipper length and body mass for each of the three islands?
The additive model assumes the same slope between height and weight of fish for the two species.
Choosing models
Rule of thumb: Occam’s Razor - Don’t overcomplicate the situation! We prefer the simplest best model.
- Choose a model using this principle.
# interaction
<- linear_reg() |>
fish_whs_int_fit fit(weight ~ height * species, data = fish)
tidy(fish_whs_int_fit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -942. 68.6 -13.7 9.20e-19
2 height 103. 4.48 22.9 1.61e-28
3 speciesRoach 673. 93.5 7.20 2.64e- 9
4 height:speciesRoach -39.9 10.4 -3.85 3.30e- 4
glance(fish_whs_int_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.969 0.968 51.3 540. 1.31e-38 3 -293. 595. 605.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# additive
<- linear_reg() |>
fish_whs_fit fit(weight ~ height + species, data = fish)
tidy(fish_whs_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -828. 69.7 -11.9 1.92e-16
2 height 95.2 4.54 21.0 5.10e-27
3 speciesRoach 343. 41.8 8.19 6.35e-11
glance(fish_whs_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.961 0.959 57.7 634. 3.07e-37 2 -300. 607. 615.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Choose the interaction model since it has a higher adjusted R-squared.
- What is R-squared? What is adjusted R-squared?
R-squared is the percent variability in the response that is explained by our model. (Can use when models have same number of variables for model selection)
Adjusted R-squared is similar, but has a penalty for the number of variables in the model. (Should use for model selection when models have different numbers of variables).