Language is one of the most distinctive and important abilities that marks us as uniquely human: Language is learned by every typically developing human child learns language, and no member of any other species on the planet. What’s more, children appear to be uniquely adept at learning language–effortlessly acquiring complex linguistic structures that elude frustrated adults in language courses around the world. But are children really better at learning language than adults? Decades of research in language acquisition have attempted to discover whether (and why) there may be a critical period in language acquisition: An age after which the ability to learn declines dramatically.

A recent paper from Hartshorne, Tenenbaum, & Pinker (2018) attmpted to address this paper with a large-scale study of over 600,000 second language learners across the world. They developed a language game that uses grammatical questions to determine which dialect of English people who play the game speak. They used the results of this game–in conjunction with demographic information about the people who played it to study how level of English proficiency changes with both the amount of time people have spent speaking English and the age at which they started learning.

In your project, you will do your own investigation of their data, answering questions that you find compelling about the process of language acquisition. You might be interested in understanding how people in different countries learn differently, how difficult American versus British English is to learn, how different features of language (e.g. noun identification vs. tense agreement) vary in their ease of learning, etc. Any question that you can pose as a statistical inference is fair game.

As part of this project you will complete exploratory data analysis (EDA), inference, and modeling. You’ve had a chance to explore each of these methods individually, and now you’ll be asked to put them all together and apply them to say something interesting about a new dataset.

This data is stored in the feather format–a format that is faster to read in than comma seperated value (.csv), and also maintains information about the type of each column in the dataset. To read it, you will need to load the feather package.


We’ll be using a subset of the complete Which English data: Information about people who completed the language test in 2014. There are a total of 121 different variables you could possibly analyze.

Some of these variables code for demographic information about the participants (e.g. gender), some code for the language background of participants (e.g. native_english), and others code for their performance on the language task (e.g. q1). Some of these variables might be more useful than others for understanding variability in language learning, and some might not be useful at all. It is up to you to decide which variables are meaningful and which should be omitted from your analysis and writeup.

You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions. When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.


  1. date: Date when the test was taken
  2. time: Time when the test was taken
Participant Demographic Information:
  1. gender: The participant’s gender
  2. age: The participant’s age
  3. psychiatric_disorder: TRUE if the participant self-identified as having a psychiatric disorder
  4. education: The participant’s educational attainment
  5. all_countries: All countries the participant has lived in
  6. current_country: Country the participant currently lives in
  7. canada_region: Region of Canada the participant lives in (or NA)
  8. us_region: Region of the US the participant lives in (or NA) 1, uk_region: Region of the UK the participant lives in (or NA)
  9. ireland_region: Region of Ireland the participant lives in (or NA)
  10. uk_constituency: Region of Ireland the participant lives in (or NA)
Participant Language Information:
  1. native_languages: The participant’s native language(s)
  2. primary_languages: The pariticpant’s primary language(s)
  3. english_start: Age at which the participant began speaking English
  4. english_country_years: Number of years the participant lived in an English-speaking country
  5. english_country_percent: The amount of the participant’s life spent in Enlglish Speaking countries
  6. english_home: TRUE if the participant speaks language at home
  7. ebonics: TRUE if the participant speaks the Ebonics dialect of English (or NA if no answer given)
  8. english_type: whether the participant speaks a little English, a lot, or is a monolingual speaker (NA if missing).
  9. speaker_category: whether the speaker is a foreign speaker of english, a late speaker, or a native speaker
English Proficiency:
  1. q1q35_5: Questions from the online language test. TRUE if the participant answered them correctly. Full descriptions of the questions can be found here. Grouping together subsets of these (e.g. q1-q8) might be interesting for your analysis.
  2. correct: The proportion of questions the participant answered correctly
  3. logit: An empirical logit transformation of the correct column. This makes the data more amenable for analyses that require approximately Normally distributed data.

Report Content

  1. Introduction (10 Points): Outline your main research question(s). Lay out the expectations that the reader should have about what is going to happen in the rest of the document. Maybe preview some of the conclusions you find.

  2. EDA (30 Points): Do some exploratory data analysis to tell some “interesting” stories about earnings or other variables in the data. You want us to see that you know when to use which kind of plot (histogram vs. box and whiskers vs. scatterplot etc). Maybe also try looking not at just relationships between just two variables, but also for instance relationships two variables while controlling for another.
    To get full credit, you’ll need to present at least 3 EDAs. Points will be deducted if you use an inappropriate EDA for the data you’re looking at.

  3. Inference (30 Points): Try to answer your research questions using the method of inferential statistics (e.g. t-tests, difference between two proportions, sampling analysis like we did in the first half of the class) etc. These questions could be used to shed some light on your choice of the “best” linear model (below). Think about what kind of test is appropriate for the kind of data you’re looking at–e.g. Measure of time (age or english_country_years) distributions might be skewed, so maybe you need to sample for them instead of using normal approximations.
    For full credit, present at leat 3 inferential tests. It’s not critical that you do one of each kind. The important thing is to do some analyses that lead sensibly into the next section. You don’t need to analyze exactly all of the same variables, but do something that helps you tell a coherent story. E.g. this section could have some analyses that relate demographic variables to language information, the next section could relate some of these variables to proficiency

  4. Modeling (20 Points): Develop some simple and multiple regression models to try to understand some of the key variables in the dataset–probably the most interesting ones are correctness, english_years, and age. Maybe for a particular kind of question (e.g. picture identification:q1q8)? Can you predict these from other variables in the dataset? Do they predict each other. End with a model that you think is the “best” model for explaining one of these variables. This model should be good by standard statistical metrics (i.e. include variables that reliably predict variability in the variable of interest), and also ideally theoretically plausible (i.e. you can try to tell a coherent story about it).
    For full credit, you need to make an argument for each variable you include, and possibly a few variables you exclude (if they are really similar to variables you did include. E.g. “we used this set of demographic instead of this other set because they are collinear and the set we used is more predictive”).

  5. Conclusion (10 Points): (A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question(s). You should also discuss any shortcomings of your current study (either due to data collection or methodology) and include ideas for possible future research.

Finally, don’t stress out! We know that people came into this class with varying levels of R and general programming experience–that’s why we are encouraging you to work with a partner. But we’ll also be sensitive to that when grading. This assignment is intended to show you (and us) that you learned something about the practice of statistics in this class. You don’t need to demonstrate that you learned everything.

Report format

R Markdown: All code used to generate the statistics and plots should be organized and submitted in an R Markdown document.

You can find the template for the project in Course RStudio Cloud Space here: Final Project. Remember all of your code and answers go in this document

Teamwork and grading

You will be allowed to work with one other member of the class and submit on the project. You can either choose your partner, or if you’d prefer we can choose a partner for you. You’re also welcome to work alone if you prefer.

Both partners will receive the same grade.

Honor code

You may not discuss this project in any way with anyone outside your team, besides the professor and TA. Failure to abide by this policy will result in a 0 for all teams involved.

You can still of course ask general coding questions on Piazza. But please don’t post a large block of code and then ask “why doesn’t this knit?”


This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.

You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.