HW 4

Important

This homework is due Thursday, Nov 9 at 5:00 pm ET.

Getting Started

  • Go to the sta113-f23 organization on GitHub. Click on the repo with the prefix hw-4. It contains the starter documents you need to complete the homework assignment.
  • Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
  • In RStudio, go to FileNew ProjectVersion ControlGit.
  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
  • Click hw-4.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Packages

library(tidyverse)
library(tidymodels)
library(openintro)

Data

The Human Freedom Index attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.

In this assignment, you'll be analyzing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.

The assignment is comprised of two parts:

  • Part 1: Work with older, but well-documented data to fit and interpret a model.

  • Part 2: Work with more recent, but not well-documented data to ask a question of interest to you and answer it with a model (or models) and a visualization (or visualizations).

Guidelines + tips

Your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GithHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Note

Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You are subject to lose points if you do not.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Use informative title and axis labels.
  • Turn in an organized, well formatted document.

Exercises

Part 1: HFI in 2016

The dataset for Part 1 is in the openintro package and it’s called hfi. See the data dictionary with ?hfi.

Exercise 1

The dataset, hfi spans a lot of years, but we are only interested in data from year 2016. Filter the data frame for the year 2016 and assign the result to a data frame named hfi_2016. What are the dimensions of the 2016 dataset? What does each row represent?

Exercise 3

What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control? Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country's pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score? If the relationship looks linear, calculate the correlation between pf_expression_control and pf_score.

Exercise 4

Fit a model predicting pf_score from pf_expression_control and display a tidy output of the model. Then interpret the intercept and the slope of the model in the context of the data.

Exercise 5

Create a residuals vs. predicted plot and evaluate the model fit.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 2

In Part 2, you’ll use data released in 2022, which comes from https://www.cato.org/human-freedom-index/2022. The dataset is in your data folder: human-freedom-index-2022. The most recent year this dataset captures is 2020. Note that there isn’t a data dictionary associated with dataset so you’ll need to match variables to the data dictionary for the old data or look up information online. Working with real data can be tedious!

Exercise 6

Split the data into training and testing datasets, in 3-1 ratio. Report how many observations are in each split.

Exercise 7

Focus on the most recent year, 2020, and fit a model predicting hf_score, human freedom score, from a number of predictors (at least 2) of your choosing. Interpret the intercept and the slope in the context of the data.

Exercise 8

Focus on the most recent year, 2020, again and fit a model predicting a categorical version of the hf_score, human freedom score, a number of predictors (at least 2) of your choosing. You can decide how to make hf_score categorical – it could be below/above 5, or some other cuttoff value. Then, using your model, make predictions for the testing data and calculate the false positive and false negative rates of your predictions. Visualize this information however you see fit.

Exercise 9

Focus on a country (or a small number of countries) of your choosing, and create a visualization that shows how one or more indices of interest has changed over the years. You do not need to fit a model for this question, it’s primarily about your visualization. In a few sentences describe the goal of your visualization as well as any insights you get from it.

Render, commit, and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Review

Before you call it “done”,

  • make sure you’ve answered all questions (or, at least, have not skipped any accidentally) and
  • review the Section 1.3 section and make any edits/corrections needed.

You can change answers to exercises you’ve “completed” and commit and push again. We will only grade the final version of your assignment. You can commit and push as many times as you like until the deadline.

Submission

You do not have to do anything special to “submit” your assignment. We will close the repos to further pushes at the deadline, and will take your work as of that time point as your submission.

If you need to submit late for any reason, contact the professor.

Grading

  • Exercise 1-9: 10 points each
  • Workflow + formatting: 10 points
  • Total: 100 points