Lecture 16
Duke University
STA 113 - Fall 2023
One of the pictures is a photograph of a painting by Piet Mondrian while the other is a photograph of a drawing made by an IBM 7094 digital computer. Which of the two do you think was done by the computer?
A. M. Noll, “Human or machine: A subjective comparison of piet mondrian’s”composition with lines” (1917) and a computer- generated picture,” The Psychological Record, vol. 16, pp. 1–10, 1966.
What is apophenia?
The tendency to perceive a connection or meaningful pattern between unrelated or random things.
Null hypothesis, \(H_0\): Defendant is innocent
Alternative hypothesis, \(H_A\): Defendant is guilty
Present the evidence: Collect data
Start with a null hypothesis, \(H_0\), that represents the status quo
Set an alternative hypothesis, \(H_A\), that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
Type 1 error (false negative): Acquit / reject \(H_0\) when you shouldn’t
Type 2 error (false positive): Falsely convict an innocent / fail to reject \(H_0\) when you shouldn’t
Costs of these errors vary based on the severity of the consequences
Statistical
Visual
Plot of the real data is randomly embedded amongst a set of null plots
Matrix of plots is known as a lineup
Null plots are generated by a method consistent with the null hypothesis
The lineup is shown to an observer. If the observer can pick the real data as different from the others, this puts weight on the statistical significance of the structure in the plot.
The lineup()
function from the nullabor package returns a set of generated null datasets and the real data embedded randomly among these null datasets
nullabor::lineup()
lineup()
function generate themlineup()
functionmtcars
please stand up?Step 1. Permute the data
Step 2. Peek at the permuted data
mpg cyl disp hp drat wt qsec vs am gear carb .sample
...1 10.4 6 160 110 3.90 2.620 16.46 0 1 4 4 1
...2 21.4 6 160 110 3.90 2.875 17.02 0 1 4 4 1
...3 14.7 4 108 93 3.85 2.320 18.61 1 1 4 1 1
...4 15.5 6 258 110 3.08 3.215 19.44 1 0 3 1 1
...5 30.4 8 360 175 3.15 3.440 17.02 0 0 3 2 1
...6 15.0 6 225 105 2.76 3.460 20.22 1 0 3 1 1
Step 2. Peek at the permuted data
n = 20
by default
Step 3. Plot the permutations
rorschach()
function returns a set of null plots which are shown to observers to calibrate their eyes with variationnullabor::rorschach()
Generating null plots: Provide a method
of generation and let the rorschach()
function generate them
Provide the true
data set
Set n
, total number of samples to generate (n = 20
by default)
Set p
, probability of including true data with null data (p = 0
by default)
mtcars
Step 1. Permute the data
Step 2. Peek at the permuted data
mpg cyl disp hp drat wt qsec vs am gear carb .sample
1 10.4 6 160 110 3.90 2.620 16.46 0 1 4 4 1
2 21.4 6 160 110 3.90 2.875 17.02 0 1 4 4 1
3 14.7 4 108 93 3.85 2.320 18.61 1 1 4 1 1
4 15.5 6 258 110 3.08 3.215 19.44 1 0 3 1 1
5 30.4 8 360 175 3.15 3.440 17.02 0 0 3 2 1
6 15.0 6 225 105 2.76 3.460 20.22 1 0 3 1 1
Step 2. Peek at the permuted data
n = 20
by default
Step 3. Plot the permutations
In this particular case there’s nothing to decrypt since p
(probability of including true data with null data) is set to 0
If p
is higher than 0, and the true null is included, you get the decryption key
mtcars
)null_dist()
function takes as input a variable name of the data and a particular distributionnull_dist()
returns a function that generates a null data set given the dataThe following histogram shows the distribution of heights of 507 physically active individuals (openintro::bdims$hgt
). Do the heights of these individuals follow a normal distribution?
Which of the following is the plot of the real data? (Note: A different binwidth than the previous plot is used.)
Generate the null distribution
null_lm()
takes as input a model specification formula as defined by lm()
and method for generating null residuals from the model
Three built in methods for different (and valid) methods to generate null data when fitting a linear model:
method = "pboot"
method = "boot"
method = "rotate"
null_lm()
returns a function which given the data generates a null dataset
Data measures the diameter, height and volume of timber in 31 felled black cherry trees (datasets::trees
)
Summary
── Data Summary ────────────────────────
Values
Name datasets::trees
Number of rows 31
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Girth 0 1 13.2 3.14 8.3 11.0 12.9 15.2 20.6 ▃▇▃▅▁
2 Height 0 1 76 6.37 63 72 76 80 87 ▃▃▆▇▃
3 Volume 0 1 30.2 16.4 10.2 19.4 24.2 37.3 77 ▇▅▁▂▁
Plot
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -6.63 0.800 -8.29 5.06e- 9
2 log(Girth) 1.98 0.0750 26.4 2.42e-21
3 log(Height) 1.12 0.204 5.46 7.81e- 6
trees_aug <- as_tibble(datasets::trees) |>
mutate(
.resid = residuals(trees_fit),
.fitted = fitted(trees_fit)
)
trees_aug
# A tibble: 31 × 5
Girth Height Volume .resid .fitted
<dbl> <dbl> <dbl> <dbl> <dbl>
1 8.3 70 10.3 0.0219 2.31
2 8.6 65 10.3 0.0343 2.30
3 8.8 63 10.2 0.0138 2.31
4 10.5 72 16.4 -0.0106 2.81
5 10.7 81 18.8 -0.0430 2.98
6 10.8 83 19.7 -0.0420 3.02
7 11 66 15.6 -0.0557 2.80
8 11 75 18.2 -0.0443 2.95
9 11.1 80 22.6 0.0822 3.04
10 11.2 75 19.9 0.00926 2.98
# ℹ 21 more rows
Which one is the real residuals plot?
\[ p(X \ge x) = 1 - P(X \le x - 1) = \sum_{k = x}^n {n \choose k} \frac{(m - 1)^k}{m^n} \]
pbinom(x, n, p)
We did three lineup tests today, where - \(x\) is the number of people who spotted the real data and - \(n\) is the number of people voting
Let’s calculate the p-values
mtcars
(n = 20)bdims::hgt
(n = 10)trees
model (n = 10)