Your reproducible lab report: You can find the template for this lab in your RStudio Cloud space: Lab 7 - Inference for Numerical Data. Remember all of your code and answers go in this document:
Since 2005, the American Community Survey polls ~$3.5 million households yearly. We will work with a randon sample of 2000 observations from the 2012 ACS.
As always, let’s get the data from the course website
Below is the codebook for this dataset:
income: Yearly income (wages and salaries)
employment: Employment status, not in labor force, unemployed, or employed
hrs_work: Weekly hours worked
race: Race, White, Black, Asian, or other
gender: gender, male or female
citizens: Whether respondent is a US citizen or not
time_to_work: Travel time to work
lang: Language spoken at home, English or other
married: Whether respondent is married or not
edu: Education level, hs or lower, college, or grad
disability: Whether respondent is disabled or not
birth_qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec
Note that this dataset contains some people who are not in the labor force or not employed. First, let’s subset the dataset for those who are employed. We will call this new dataset
acs_emp, short for “employed”. Remember that we use the
filter function for subsetting the data based on attributes stored in a variable.
acs) are employed?
Next let’s take a look at the income distribution by gender. The first step would be to create a visualization:
We can also obtain summary statistics such as means and standard deviations and sample sizes.
We can use R’s
t.test function to make statistical inferences about differences like this using the t-distribution.
t.test to find the 95% confidence interval for the difference between the average incomes of males and females using, and interpret this interval. HINT: You might want to make two separate data frames using the
filter function–one that has just the males and one that has just the females
Based on this interval is there a statistically significant difference between the averege incomes of men and women? Why, or why not?
There is a clear difference between the average salaries of men and women, but could some, or all, of this difference be attributed to a variable other than gender? Remember that we call such variables confounding variables. We will evaluate whether
hrs_work is a confounder for the relationship between gender and income.
Let’s start by just looking at the how many hours the average employee works.
Well it sure looks like many of the employees work 40 hours. But not all. Describe the shape and spread of this distribution.
Do we have a reason to think that the people in our sample come from a population of workers that isn’t just full time employees? Perform a one-sample
t-test to determine if the mean number of hours worked is different from what you would expect if everyone were a full time employee. Hint: Think about what your Null Hypothesis should be when you call
Ok, since it looks like we might have some part time workers, let’s see if this matters for the conclusion we drew earlier. First convert the
hrs_work variable to a categorical variable (with levels
"full time" or
"part time") so that we can use methods we have learned so far in the course to run the analysis. (Later in the course we will learn how to work with numerical explanatory variables in a regression model setting.)
We want to create a new variable, say
emp_type, with levels
"full time" or
"part time" depending on whether the employee works 40 hours or more per week or less than 40 hours, respectively. Remember, we create a new variable with the mutate function.
if_else() function has three arguments: a logical test, return values for TRUE elements of test, and return values for FALSE elements of test. In this case,
emp_type will be coded as
"full time" for observations where
hrs_work is greater than or equal to 40, and as
"part time" otherwise.
To find out what percent of the sample is full vs. part time, we turn to summary statistics:
Here we first grouped the data by the new
emp_type variable, and then we calculated proportions of full and part time employees by first counting how many there are in each group (
n()), and then
ungrouping the data and dividing the total in each group by the total in all groups.
Create a bar plot of the distribution of the
emp_type variable, and also include the summary statistics you calculated above in your answer. What percent of the sample are full time and what percent are part time employees?
Are women more heavily represented among full time employees or part time employees? Answer this question using summary statistics (code provided below) and a visualization.
Create two subsets of the
acs_emp dataset: one for full time employees and one for part time employees. No interpretation is needed for this question, just the code is sufficient.
Use a hypothesis test to evaluate whether there is a difference in average incomes of full time male and female employees, and also include a confidence interval (at the equivalent confidence level) estimating the magnitude of the average income difference.
Use a hypothesis test to evaluate whether there is a difference in average incomes of part time male and female employees, and also include a confidence interval (at the equivalent confidence level) estimating the magnitude of the average income difference.
What do your findings from these hypothesis test suggest about whether or not working full or part time might be a confounding variable in the relationship between gender and income?
Pick another numerical variable from the dataset to be your response variable, and also pick a categorical explanatory variable (can be one we used before). Conduct the appropriate hypothesis test to compare means of the response variable across two levels of the explanatory variable. Make sure to state your research question, and interpret your conclusion in context of the dataset. Note that you can use the complete
acs dataset, the subsetted
acs_emp dataset, or another subset that you create.
This lab is created and released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This lab is adapted from a lab created for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.