A note on workspaces: Before we get started with the lab, let’s take a moment to review our R Markdown workflow and remind ourselves about workspaces in R. The workspaces of the console and the workspaces of your R Markdown document are not the same. Therefore, if you define a variable only in the Console and then try to use that variable in your R Markdown document, you’ll get an error. This might seem frustrating at first, but it is actually a feature that helps you in the long run. In order to ensure that your report is fully reproducible, everything that is used in the report must be defined in the report, and not somewhere else.

It is your responsibility, and an important learning goal of this course, that you master the skills for creating fully reproducible data analysis reports. Below are some tips for achieving this goal:

• Always work in your R Markdown document, and not in the Console.
• Knit early, and often, always checking that the resulting document contains everything you expected it to contain.

Your reproducible lab report: You can find the template for this lab in your RStudio Cloud space: Lab 4 - Hypothesis Testing. Remember all of your code and answers go in this document:

The Hot Hands Phenomenon

Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.

We do not expect to resolve this controversy today. However, in this lab we’ll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.

Getting Started

Our investigation will focus on the performance of one player: Stephen Curry of the Gold State Warriors. Curry was the NBA Most Valuable Player for both the 2014 and 2015 seasons–the second time by unanimous vote. Maybe that’s because of hot hands? We’ll be looking at some data on Curry that I pulled from NBA Stats.

Let’s read in the data and see what we have.

curry_data <- read_csv("data/curry_data.csv")

Let’s take a look at this new data set: curry_data. Try using View(curry_data) in your console (or clicking on the data frame in the Environment tab). This data frame has information about every shot that Steph Curry took during the 2014 regular season. There are 1341 observations and 14 variables, where every row records a single shot taken. The SHOT_MADE variable in this dataset indicates whether the shot scored (TRUE) or was a miss (FALSE).

As always, let’s do a little bit of exploratory data analysis to make sure we understand our data. Let’s do two things. First, let’s compute Curry’s average shooting success. Second, let’s use the group_by and summarise functions from the tidyverse to look at his shooting success from game to game.

mean_shooting_prob <- curry_data %>%
pull()

by_game_data <- curry_data %>%
group_by(GAME_ID) %>%

ggplot(by_game_data, aes(x = shooting_prob)) +
geom_histogram(bins = 15) +
geom_vline(aes(xintercept = mean_shooting_prob), color = "gold", size = 2)

1. Describe the distribution that we’re seeing. What proportion of his shots does Curry make on average, and how does this vary from game to game?

Now let’s use this data to ask whether Curry’s shots show evidence of hot hands. One way we can approach this question is to look at whether Curry is more likely to make his next shot if he has just made his previous shot.

1. Come up with a Null Hypothesis and Alternative Hypothesis that we can use to ask whether Curry’s shots provide evidence for the hot hands phenomenon.

We can use the lag function to make our life a lot easier. This will let us compare each row in the data frame with the row that came just before it. Because the data are already arranged chronologically, each shot in our data frame comes right after the shot it follwed.

Let’s try this out.

lag_data <- curry_data %>%
group_by(GAME_ID) %>%
mutate(lag_shot = lag(SHOT_MADE))

I’ll select just the three relevant variables to make looking at the data easier, and then print out the first 10 rows

lag_data %>%
head(10) 
## # A tibble: 10 x 3
## # Groups:   GAME_ID [1]
##    <chr>      <lgl>     <lgl>
##  1 0021400014 TRUE      NA
##  2 0021400014 FALSE     TRUE
##  3 0021400014 TRUE      FALSE
##  4 0021400014 TRUE      TRUE
##  5 0021400014 FALSE     TRUE
##  6 0021400014 FALSE     FALSE
##  7 0021400014 FALSE     FALSE
##  8 0021400014 FALSE     FALSE
##  9 0021400014 FALSE     FALSE
## 10 0021400014 TRUE      FALSE

You can see that the lag_shot column contains exactly the value that is in SHOT_MADE in the previous row.

1. Why did I group_by game first before calling the lag function? Hint: The NA you see in the first row means that there is no value that came before.

We can now use the new lag_shot variable to group shots by whether they followed a successful or missed shot. And thus we can test our Alternative Hypothesis. Let’s do one more quick exploratory data analysis to see what the difference in Curry’s shooting percentage is like following successful vs. unsuccessful shots.

eda_hothands <- lag_data %>%
group_by(lag_shot) %>%
summarise(shooting_prob = mean(SHOT_MADE))
1. What do you see when you run this code? Does it look like we see evidence for the hot hands phenomenon? Also, what does the value for shooting_prob for NA tell us?

Simulations in R

Now we’re ready to test our hypothesis formally. What we need to do is generate the Null Distribution for what kind of differences in shots following successful vs. unsuccessful shots we should see if there is no hot hands phenomenon–if Curry shoots identically following successful vs. unsucessful shots. We’ll do this through sampling just like we did in lecture using the sample function.

So what we want to do, is keep everything about the shooting data identical – e.g. how many shots Curry took, how many of them were successful, etc. The only thing we want to do is randomly determine whether these shots came after other shots that were successful.

lag_data %>%
group_by(GAME_ID) %>%
group_by(lag_shot, GAME_ID) %>%
summarise(mean = mean(SHOT_MADE))
# Number of shots taken after shots that were made
hot_shots <- lag_data %>%
filter(lag_shot) %>%
nrow() # count the number of rows

nrow()

# Number of shots taken after shots that were missed
not_shots <- lag_data %>%
filter(!lag_shot) %>%
nrow()

# Number of shots made after shots that were missed
nrow()

simulate_null <- function() {

# Make a list with the right number of shots of each type
shots <- c(rep("Hot", hot_shots), rep("Not", not_shots))

# randomly select the made shots from this list

# Compute the difference shot success between hot and not shots

}
3. Test your hypothesis by modifying the simulate_null function to generate your own null distribution, and compare it to the empirical mean. In your response, make both a plot like you did in Exercise 5, and tell me what percentile the empirical data falls in like you did in Exercise 6.