• # PSYC 20100 - Psychological Statistics (Autumn 2018)

## Learning Objectives

 Jump to: Unit 1 Unit 2 Unit 3 Unit 4

### Unit 1:

##### Suggested exercises: (End of chapter exercises from OpenIntro Statistics with Randomization)
• Part 1 – Designing studies: 1.1, 1.3, 1.9, 1.11, 1.13, 1.19
• Part 2 – Examining Data: 1.25, 1.27, 1.29, 1.37, 1.39, 1.41, 1.47

Suggested readings: Section 1.1 and 1.2 of OpenIntro Statistics with Randomization and Simulation

• Learning Objective (LO) 1. Identify variables as numerical and categorical.
If the variable is numerical, further classify it as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. If the variable is categorical, determine if it is ordinal based on whether or not the levels have a natural ordering.

• LO 2. Define associated variables as variables that show some relationship with one another. Further categorize this relationship as positive or negative association, when possible.

• LO 3. Define variables that are not associated as independent.
Test yourself: Give one example of each type of variable you have learned.

Suggested readings: Sections 1.3 and 1.5 of OpenIntro Statistics with Randomization and Simulation

• LO 4. Identify the explanatory variable in a pair of variables as the variable suspected of affecting the other. However, note that labeling variables as explanatory and response does not guarantee that the relationship between the two is actually causal, even if there is an association identified between the two variables.

• LO 5. Classify a study as observational or experimental, then determine and explain whether the study’s results can be generalized to the population and whether the results suggest correlation or causation between the quantities studied.
If random sampling has been employed in data collection, the results should be generalizable to the target population. If random assignment has been employed in study design, the results suggest causality.

• LO 6. Question confounding variables and sources of bias in a given study.
Test yourself: Describe when a study’s results can be generalized to the population at large and when causation can be inferred.
Explain why random sampling allows for generalizability of results.
Explain why random assignment allows for making causal conclusions.

Suggested reading: Section 1.6 of OpenIntro Statistics with Randomization and Simulation

• LO 7. Use scatterplots for describing the relationship between two numerical variables, making sure to note the direction (positive or negative), form (linear or non-linear), and the strength of the relationship as well as any unusual observations that stand out.

• LO 8. When describing the distribution of a numerical variable, mention its shape, center, and spread, as well as any unusual observations.

• LO 9. Note that there are three commonly used measures of center and spread:
center: mean (the arithmetic average), median (the midpoint), mode (the most frequent observation). spread: standard deviation (variability around the mean), range (max-min), interquartile range (middle 50% of the distribution).

• LO 10. Identify the shape of a distribution as symmetric, right skewed, or left skewed, and unimodal, bimodoal, multimodal, or uniform.

• LO 11. Use histograms and box plots to visualize the shape, center, and spread of numerical distributions, and intensity maps for visualizing the spatial distribution of the data.

• LO 12. Define a robust statistic (e.g. median, IQR) as a statistic that is not heavily affected by skewness and extreme outliers, and determine when such statistics are more appropriate measures of center and spread compared to other similar statistics.

Suggested readings: Section 1.7 of OpenIntro Statistics with Randomization and Simulation

• LO 13. Use frequency tables and bar plots to describe the distribution of one categorical variable.

• LO 14. Use side-by-side box plots for assessing the relationship between a numerical and a categorical variable.

### Unit 2:

##### Suggested exercises: (End of chapter exercises from OpenIntro Statistics with Randomization)
• Part 1 – Randomization and Hypothesis Testing: 2.1, 2.3, 2.5, 2.9,
• Part 2 – The Normal Distribution: 2.11, 3.13, 2.15, 2.17, 2.19, 2.23, 2.29, 2.31
• Part 3 - Applying the Normal model 2.35, 2.37

### Unit 3:

##### Suggested exercises: (End of chapter exercises from OpenIntro Statistics with Randomization)
• Part 1 – Inference for a single proportion: 3.1, 3.5, 3.11, 3.19
• Part 2 – Difference of two proportions: 3.23, 3.25, 2.15, 2.17, 2.19, 2.23, 2.29, 2.31
• Part 3 - One-sample means with t-tests 4.3, 4.5
• Part 4 - Paired data: 4.9, 4.13
• Part 5 - Difference of two means: 4.15, 4.17, 4.23

Suggested reading: Section 3.1 of OpenIntro Statistics with Randomization and Simulation

• Learning Objective (LO) 1. Define population proportion $p$ (parameter) and sample proportion $\hat{p}$ (point estimate).

• LO 2. Calculate the sampling variability of the proportion, the standard error, as $SE=\sqrt{\frac{p\left(1-p\right)}{n}}$, where $p$ is the population proportion. Note that when the population proportion $p$ is not known (almost always), this can be estimated using the sample proportion, $SE=\sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}}$.

• LO 3. Recognize that the Central Limit Theorem (CLT) is about the distribution of point estimates, and that given certain conditions, this distribution will be nearly normal. In the case of the proportion the CLT tells us that if the observations in the sample are independent, the sample size is sufficiently large (checked using the success/failure condition: $np\geq10$ and $n\left(1−p\right)\geq10$, then the distribution of the sample proportion will be nearly normal, centered at the true population proportion and with a standard error of $SE=\sqrt{\frac{p\left(1-p\right)}{n}}$.

• LO 4. Remember that confidence intervals are calculated as \$point estimate\pmmargin of error and test statistics are calculated as $test statistic = \frac{point estimate - null value}{standard error}$.

• LO 5. Note that the standard error calculation for the confidence interval and the hypothesis test are different when dealing with proportions, since in the hypothesis test we need to assume that the null hypothesis is true – remember: p-value = P(observed or more extreme test statistic $\vert$ $H_{0}$ true).
• For confidence intervals use $\hat{p}$ (observed sample proportion) when calculating the standard error and checking the success/failure condition: $SE=\sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}}$.
• For hypothesis tests use $p_{0}$ (null value) when calculating the standard error and checking the success/failure condition: $SE=\sqrt{\frac{p_{0}\left(1-p_{0}\right)}{n}}$.
• LO 7. Explain why when calculating the required minimum sample size for a given margin of error at a given confidence level, we use $\hat{p}=0.5$ if there are no previous studies suggesting a more accurate estimate.
• Conceptually: When there is no additional information, 50% chance of success is a good guess for events with only two outcomes (success or failure).
• Mathematically: Using $\hat{p}=0.5$ yields the most conservative (highest) estimate for the required sample size.
• Test yourself:
Suppose 10% of U Chicago students smoke. You collect many random samples of 100 U Chicago students at a time, and calculate a sample proportion ($\hat{p}$) for each sample, indicating the pro- portion of students in that sample who smoke. What would you expect the distribution of these $\hat{p}$s to be? Describe its shape, center, and spread.
Suppose you want to construct a confidence interval with a margin of error no more than 4% for the proportion of U Chicago students who smoke. How would your calculation of the required sample size change if you don’t know anything about the smoking habits of U Chicago students vs. if you have a reliable previous study estimating that about 10% of U Chicago students smoke?

Suggested reading: Section 3.2 of OpenIntro Statistics with Randomization and Simulation

• LO 8. Note that the calculation of the standard error of the distribution of the difference in two independent sample proportions is different for a confidence interval and a hypothesis test.
• confidence interval and hypothesis test when $H_{0}:p1−p2$ = some value other than 0:
$SE_{\left(\hat{p}_{1}-\hat{p}_{2}\right)} = \sqrt{\frac{\hat{p}_{1}\left(1-\hat{p}_{1}\right)}{n_{1}} + \frac{\hat{p}_{2}\left(1-\hat{p}_{2}\right)}{n_{2}}}$
• confidence interval and hypothesis test when $H_{0}:p_{1}−p_{2}$ = some value other than 0:
$SE_{\left(\hat{p}_{1}-\hat{p}_{2}\right)} = \sqrt{\frac{\hat{p}_{pool}\left(1-\hat{p}_{pool}\right)}{n_{1}} + \frac{\hat{p}_{pool}\left(1-\hat{p}_{pool}\right)}{n_{2}}}$\
• where $\hat{p}_{pool}$ is the overall rate of success:
$\hat{p}_{pool}=\frac{number\, of\, successes\, in\, group\, 1\, + number\, of\, successes\, in\, group\, 2}{n_{1} + n_{2}}$
• LO 9. Note that the reason for the difference in calculations of standard error is the same as in the case of the single proportion: when the null hypothesis claims that the two population proportions are equal, we need to take that into consideration when calculating the standard error for the hypothesis test, and use a common proportion for both samples.
Test yourself:
Suppose a 95% confidence interval for the difference between male and female U Chicago students who smoke (male - female) is (-0.08,0.11). Interpret this interval, making sure to incorporate into your interpretation a comparative statement about the two genders of U Chicago students. Does the above interval suggest a significant difference between the true proportions of smokers in the two groups?
Suppose you had a sample of 100 male U Chicago students where 11 of them smoke, and a sample of 80 female U Chicago students where 10 of them smoke. Calculate $\hat{p}_{pool}$. When and why do we use $\hat{p}_{pool}$ in calculation of the standard error for the difference between two sample proportions?

Suggested reading: Section 4.1 of OpenIntro Statistics with Randomization and Simulation

• LO 10. Use the $t$-distribution for inference on a single mean, difference of paired (dependent) means, and difference of independent means.

• LO 11. Explain why the $t$-distribution helps make up for the additional variability introduced by using $s$ (sample standard deviation) in calculation of the standard error, in place of $\sigma$ (population standard deviation).

• LO 12. Describe how the $t$-distribution is different from the normal distribution, and what “heavy tail” means in this context.

• LO 13. Note that the $t$-distribution has a single parameter, degrees of freedom, and as the degrees of freedom increases this distribution approaches the normal distribution.

• LO 14. Note that the $t$-distribution has a single parameter, degrees of freedom, and as the degrees of freedom increases this distribution approaches the normal distribution.

• LO 15. Use a $t$-statistic, with degrees of freedom $df=n−1$ for inference for a population mean:
• CI: $\bar{x} \pm t^{\star}_{df}SE$ and HT: $T_{df} = \frac{\bar{x}-\mu}{SE}$ where $SE = \frac{s}{\sqrt{n}}$
• LO 16. Describe how to obtain a p-value for a t-test and a critical t-score $\left(t^{\star}_{df}\right)$ for a confidence interval.
Test yourself:
• What is the $t^{\star}$ for a 95% confidence interval for the mean, where the sample size is 13
• What is the p-value for a hypothesis test where the alternative hypothesis is two-sided, the sample size is 20, and the test statistic, T, is calculated to be 1.75?

Suggested reading: Section 4.2 of OpenIntro Statistics with Randomization and Simulation

• LO 17. Define observations as paired if each observation in one dataset has a special correspondence or connection with exactly one observation in the other data set.

• LO 18. Carry out inference for paired data by first subtracting the paired observations from each other, and then treating the set of differences as a new numerical variable on which to do inference (such as a confidence interval or hypothesis test for the average difference).

• LO 19. Calculate the standard error of the difference between means of two paired (dependent) samples as $SE=\frac{s_{diff}}{\sqrt{n_{diff}}}$ and use this standard error in hypothesis testing and confidence intervals comparing means of paired (dependent) groups.

• LO 20. Use a t-statistic, with degrees of freedom $df=n_{diff}−1$ for inference for the difference in two paired (dependent) means:
• CI: $\bar{x} \pm t^{\star}_{df}SE$ and HT: $T_{df} = \frac{\bar{x}_{diff}-\mu_{diff}}{SE}$ where $SE = \frac{s}{\sqrt{n}}$. Note that $\mu_{diff}$ is often 0, since often $H_{0}: \mu_{diff} = 0$
• LO 21. Recognize that a good interpretation of a confidence interval for the difference between two parameters includes a comparative statement (mentioning which group has the larger parameter).

• LO 22. Recognize that a confidence interval for the difference between two parameters that doesn’t include 0 is in agreement with a hypothesis test where the null hypothesis that sets the two parameters equal to each other is rejected.
Test yourself:
• 20 cardiac patients’ blood pressure is measured before taking a medication, and after. For a given patient, are the before and after blood pressure measurements dependent (paired) or independent?
• A random sample of 100 students were obtained and then randomly assigned into two equal sized groups. One group went on a roller coaster while the other in a simulator at an amusement park. Afterwards their blood pressure measurements were taken. Are the measurements dependent (paired) or independent?

Suggested reading: Section 4.3 of OpenIntro Statistics with Randomization and Simulation

• LO 23. Calculate the standard error of the difference between means of two independent samples as $SE=\sqrt{\frac{s^{2}_{1}}{n_{1}} + \frac{s^{2}_{2}}{n_{1}}}$ and use this standard error in hypothesis testing and confidence intervals comparing means of independent groups.

• LO 24. Use a t-statistic, with degrees of freedom $df=min(n_{1}−1,n_{2}−1)$ for inference for the difference in two independent means:

• CI: $\left(\bar{x}_{1} - \bar{x}_{2}\right) \pm t^{\star}_{df}SE$ and HT: $T_{df} = \frac{\left(\bar{x}_{1} - \bar{x}_{2}\right) - \left(\mu_{1} - \mu_{2}\right)}{SE}$ where $SE = \sqrt{\frac{s^{2}_{1}}{n_1} + \frac{s^{2}_{2}}{n_2}}$.
Note that $\mu_{1}-mu_{2}$ is often 0, since often $H_{0}: \mu_{1} - \mu_{2}= 0$
• Test yourself:
Describe how the two sample means test is different from the paired means test, both conceptually and in terms of the calculation of the standard error.
A 95% confidence interval for the difference between the number of calories consumed by mature and juvenile cats $\left(\mu_{mat}−\mu_{juv}\right)$ is (80 calories, 100 calories). Interpret this interval, and determine if it suggests a significant difference between the two means.