Appendix A — Example Datasets

I utilize several different datasets throughout this book to develop concepts and to provide examples. Below, I provide a description of each of the datasets and the variables they contain. You should become familiar with these datasets.

If you are taking this course from me, you will have access to these datasets through Posit Cloud. You can also download a ZIP file of these datasets here. You should unzip this folder to your own computer and place it somewhere you can easily access it. The datasets are in an RData format which can be loaded into R with the load command or by clicking the dataset from the Files tab in RStudio.

Serious researchers should be aware that I have imputed missing values and in some cases injected some randomness into variables for pedagogical purposes. The datasets should not be used for research purposes. For those interested, the fully reproducible code used to produce these analytical datasets is available here.

Crimes

The crimes data contain information on crime rates and demographic variables for all fifty US states and the District of Columbia. The crime rates are averaged over the years 2014-2018 and come from the FBI’s Uniform Crime Reports (UCR). The UCR is a program where local law enforcement agencies all report crime statistics to the FBI and these are aggregated into final crime statistics. For our purposes, we are dividing crimes into two main categories of violent and property crime.

The demographic characteristics come from the American Community Survey (ACS) between the years 2014 and 2018. The ACS is an annual sample of the US population. To get a large enough sample in each state to calculate correct statistics (with little sampling error), I combine five years of data that are “centered”” on 2016.

Here is a full description of all variables in the dataset that we will use.

violent_rate: violent crimes per 100,000 population within each state. This includes the crimes of murder, rape, robbery, and aggravated assault. By dividing the number of crimes by the population size, we avoid the problem of larger population states having more crimes because of a larger population. This is often called the crime “rate.”
property_rate: property crimes per 100,000 population. This includes the crimes of burglary, larceny, and motor vehicle theft.
median_age: Median age of a state’s population.

percent_male: Percent of a state population that is male.

percent_lhs: The percent of the state population over the age of 25 without a high school diploma.
median_income: Median household income in a state. This is measured in thousands of 2018 US dollars (i.e. 35 means $35,000). We are taking the income of each household (meaning all members of that household combined) rather than individual level income. For most purposes, this is thought to be a better measure because consumption and savings are typically organized at the household level.
unemploy_rate: Unemployment rate in the state. The unemployment “rate” is really just a percentage. Its the percentage of individuals who are not working but want to work among all those in the labor force (those who are working or looking for work).
poverty_rate: Poverty rate in the state. The poverty “rate” is also really just a percentage. It is the percent of individuals living below the poverty line. The poverty line is a number developed by the federal government. It was originally developed in the 1960s and is adjusted for inflation every year. Many people critique the poverty line as being too low because it has not kept pace with increases in the consumer price index.
gini: A measure of income inequality in the state. The gini coefficient is a widely used measure of how unequally income is distributed. If gini is zero, then everyone has exactly the same income. If gini is 100, then one person makes all the money and everyone else zero. The higher the gini coefficient, the more income inequality exists.

Earnings

This data has information on the hourly wages of US workers in 2018. The data here are extracted from Current Population Survey data via IPUMS. I used the earning data from the outgoing rotation groups (ORG) for each month of the CPS. Each household in the CPS is is part of a rolling panel in which they are in for four months, out for eight months, and back in for four months. In the fourth and eight month of inclusion they are given additional questions as part of the outgoing rotation group. The hourly wage of salaried workers is assessed by a question on hours worked in a typical week and earnings in the prior week.

I limited the data only to those individuals between the ages of 18 and 65 in order to capture the age range of the typical worker. The dataset contains the following variables:

wages: The hourly wage for the respondent. For workers who report being paid hourly, this value is based on a direct question that asked for respondents’ hourly wages. For individuals in salaried positions, this value was derived by dividing the earnings from the previous week by the hours worked in the previous week. Anyone who reported a wage of less than one dollar is removed. Any wage higher than $99.99 is top-coded as $99.99.
age: age of the respondent in years.
gender: Male or Female.
race: The respondent’s racial identification recoded from two separate questions on race and hispanicity into the following categories: White, Black, Latino, Asian, Indigenous, and Other/Multiple races. The indigenous category includes American Indians, Pacific Islanders, and Alaska Natives.
marstat: The respondent’s current marital status: never married, married, divorced or separated, and widowed.
education: The respondent’s highest educational attainment: no high school diploma, high school diploma, associate’s degree, bachelor’s degree, graduate degree. The last category includes master’s degrees, professional degrees, and doctoral degrees.
occup: The broad occupational category of the respondent. In the actual CPS data, there are hundreds of different occupations listed. For our purposes, I have simplified this into a broader (and smaller) set of occupational categories that we will use for the analysis. Here are the categories of the occupational variable, along with some examples of specific occupations:

Managers: Human resources Managers, Operations Managers
Business/Finance Specialist: Claims Adjusters, Compliance Officers, Accountants, Tax Preparers
STEM: Computer Programmers, Civil Engineers, Biological Scientists
Doctors: Dentists, Surgeons, Optometrists
Legal: Lawyers, Judges, Paralegals
Education: Preschool and Kindergarten Teachers, Librarians
Arts, Design, and Media: Artists, Dancers and Choreographers, Writers and Authors
Other Healthcare: Registered Nurses, Physical Therapists, Dental Hygienists
Social Services: Clergy, Social Workers
Service: Waiters and Waitresses, Barbers, Bartenders
Sales: Cashier, Telemarketer
Administrative Support: Bank Tellers, Data Entry Keyers, Receptionist
Manual: Carpenters, Logging Workers, Mining Machine Operators, Small Engine Mechanic

nchild: Number of own children living in the household with the respondent.
foreign_born: A variable indicating whether the respondent is foreign born or not. Recorded as “Yes” or “No”.
earn_type: This variable indicates whether the respondent reported being paid hourly wages or by salary.
earningwt: A technical weighting variable for use with any CPS analysis of earnings. This variable is only used in the advanced chapters.

Movies

The movie data contain information about 4,343 movies produced between 2000 and 2021. The data come from the Internet Movie Database and have been supplemented with extra information from the Open Movie Database. I have limited the total number of movies in the following ways:

I have restricted the dataset to English language movies produced in the US (they may be filmed elsewhere).
I have restricted movie runtime to movies that are at least 80 minutes long and no longer than 3.5 hours. The 80 minute benchmark is the lower limit for movies that the the Screen Actor’s Guild considers “feature” films.
I have restricted the dataset to movies that received at least 500 votes on the Internet Movie Database.
I have restricted the dataset to movies that received a maturity rating between G and R.
Movies must have valid responses on all variables in the Open Movie Database and must have made at least $100,000 domestically at the box office.
I have excluded documentaries.

Here are the variables we have for each movie:

year: The calendar year of the film’s release.
runtime: The length of the movie in minutes.
maturity_rating: The movie’s MPA maturity rating (G, PG, PG-13, or R).
genre: The genre of the film. This is a tricky variable to create. In actuality, movies could be listed in up to three multiple genres in the IMDB. For example, “No Country for Old Men” is listed in the genres of crime, drama, and thriller while “Lord of the Rings: Return of the King” is listed as action, adventure, and fantasy. This is probably the best way to treat genres, but for our purposes it adds a lot of complexity. Therefore, I have recoded movies into a single “best” genre based on a decision rule where certain genres trump all others on an ordered basis. For example, comedy trumps romance, so romantic comedies will always show up in this dataset as comedies. The ordering of this system is Animation > Family > Western > Biography > Musical > Horror > Sci-Fi/Fantasy > Comedy > Sport > Romance > Action > Thriller > Mystery > Drama > All Others. For the most part, this system works well, but you may notice some odd discrepancies for a few movies.
box_office: Gross domestic (US only) box office returns for the movie in millions of US dollars. These are not adjusted for inflation.
rating_imdb: This is average score (between 1 and 10) for a movie provided by IMDB users.
metascore: The movie’s metascore rating from metacritic. The metascore is a curated weighted average of reviewer scores from a variety of sources.
awards: The number of Oscar awards that the movie received. This includes Oscars that go to individual actors (leading and supporting), as well as more general awards (best screenplay, editing, cinematography, etc.), and best picture overall.

Politics

This data comes from the 2016 American National Election Study (ANES). The ANES is a survey of the American electorate that is conducted every two years. The study collects information on a variety of political attitudes and voting behaviors. For our purposes, we are going to primarily look at respondent’s vote for president and attitudes on three issues: (1) birthright citizenship, (2) gay marriage, and (3) global warming. The variables we will look at are:

brcitizen: Respondents were asked whether they would support a proposal to change the US Constitution to remove birthright citizenship (citizenship automatically granted to individuals born in the US regardless of their parent’s citizenship status). Respondents could either favor, oppose, or neither favor or oppose.
gaymarriage: Respondents were asked for their position on gay marriage and were given the choices of “no legal recognition”, “civil union (but no marriage)”, “support gay marriage.”
globalwarm: A question on whether the respondent believes that anthropogenic global warming is happening. I constructed this variable from two separate questions. The first question asks whether respondents think that global warming has been happening with the options being that it “probably has” or “probably has not.” The second question asks whether respondents thought that global warming was caused by human activity (either entirely or partially). I combine these into a single dichotomous variable where individuals either think the earth is warming from human activity or that it is not warming from human activity, where the latter category includes people who think it isn’t warming at all and people who think it is warming but not because of human activity.
party: The political party with which the respondent identifies. This does not necessarily mean that a respondent is officially registered with a given party.
relig: The respondent’s religion. This category is based on the combination of people’s statement about the kind of services they typically attend along with several non-exclusive yes/no questions about their religion (e.g. evangelical, Pentecostal, agnostic, atheist).
age: The age of the respondent.
gender: The respondent’s self-reported gender, recorded as “Male”,“Female”, or “Other.”
race: the racial identification of the respondent. Respondents could write in multiple races, but to keep it simple, we will combine the small number of individuals who reported multiple races with those who listed “Other” as their race.
educ: The education of the respondent. This is recorded as an ordinal variable. The “Some college” response indicates individuals who have attended college (including 2-year programs) but have not earned a BA.
income: The family income of the respondent in 1000s of dollars. Respondents did not give actual dollar amounts here but rather indicated which bracket of income (e.g. $20,000-30,000) they fell within. For the purposes of our class, I randomly select an actual value within this bracket for each respondent.
workstatus: The work status of the respondent. Respondents could either be working, unemployed, or out of the labor force. The last category refers to people who are not employed and not currently looking for work, whereas unemployed indicates a person who is not employed an is currently looking for work.
military: Whether the respondent has ever served or is currently serving in the US military.

Popularity

This data comes from the National Longitudinal Study of Adolescent to Adult Health (Add Health), conducted by the Carolina Population Center at UNC-Chapel Hill and supported by a grant from the National Institute of Child Health and Human Development. The first wave of the study which we are using surveyed adolescents between 7th and 12th grade in school in the 1994-95 school year. One of the particularly valuable features of the Add Health survey is that many respondents were in the “saturation sample” which sampled all students at 16 schools. In this saturation sample, students were asked about who were their friends and sexual partners, which allows researchers to construct network maps of adolescent social systems.

We will use this saturation sample to look at a various basic measure of that network that estimates students’ popularity. This measure, which is called “in degree” in the network analysis literature, measures the number of times a student was nominated as a friend by other students in the school. We will treat it as a simple proxy measure of a student’s popularity. We can then look at what other student characteristics were positively or negatively associated with a student’s popularity.

Here is a full description of all variables in the dataset that we will use.

grade: Student’s grade in school.
race: A six-category nominal variable indicating the race that the student best thought described them when asked to choose a single race: white, black, Latino, Asian, American Indian, other.
gender: Student’s gender. Students were only reported as male or female.
nominations: The number of friend nominations received by other students at the same school. This is the measure of popularity that we will use.
alcoholuse: Students who reported drinking at least once or twice a month in the last twelve months were treated as “Drinkers” and all other students as “Non-drinkers.””
smoker: Students who reported smoking more than 5 cigarettes in the past 30 days were treated as “Smokers” and all others as “Non-smokers.”
pseudo_gpa: Students were asked for the most recent letter grade in four course types: math, language arts, science, and math. This variable was constructed by calculating GPA from those four responses.
honor_society: Whether the student was in honor society or not. Recorded as “Yes” or “No.”
bandchoir: Whether the student was in band or choir. Recorded as “Yes” or “No.”
nsports: The number of different school sports a student reported participating in. Students who reported more than six sports were top-coded at the value of six.
parent_income: Parent’s household income measured in $1000’s of dollars.

Sex

The sex data come from the General Social Survey (GSS) for the years between 2014 and 2021. The GSS is a survey of attitudes that is conducted every two years by the National Opinion Research Council (NORC). In addition to many other questions, respondents were asked a question about the frequency of sexual activity. We will examine that variable as well as several other social and demographic characteristics and its relationship to demographic characteristics such as age, education, and marital status. Here are the variables we will look at:

sexf: A quantitative variable indicating the frequency of sexual activity as the number of sexual encounters per year. The sexual frequency response was originally coded as an ordinal scale variable in which respondents were given a set of options from less to more sexual activity in the previous year. For our purposes, In order to have more quantitative data to work with, I have recoded this ordinal variable into a quantitative variable by randomly assigning everyone a value around the mean of their ordinal response. This creates more noise in the dataset but should produce results that are generally consistent with the original ordinal scale.
gender: The gender of the respondent.
age: The age of the respondent. The GSS only surveys adults aged 18 years and older.
marital: Marital status of the respondent: Never married, married, divorced, and widowed. A small number of “married, but separated” individuals are treated as “divorced” here.
sexorient: Sexual orientation of the respondent: heterosexual, gay or lesbian, or bisexual.
relig: Religious affiliation of the respondent. Protestants have been divided into “Mainline” and “Evangelical” based on a coding of specific denominations used by the GSS.
educ: Years of education for the respondent.

Titanic

The titanic data contain information on all 1,309 passengers aboard the Titanic. The data do not include information about the crew. The data primarily come from the online database, Encyclopedia Titanica. Here are the variables we will look at:

survival: Did the passenger survive?
sex: The reported sex of the passenger.
age: The age of the passenger. This variable is reported in whole numbers for those over one year old and as a decimal (based on months of age) for infants under a year of age.
agegroup: A categorical variable indicating whether the person was an adult or a child. I have constructed this variable from the age variable. The cutoff for adults is sixteen years of age.
pclass: There were three passenger classes: First, second, and third (also known as steerage). To give some pop culture references, Rose was first class, and Jack was third class. Most of the passengers were in third class.
fare: The fare paid for the ticket, measured in British pounds.
family: The number of family members traveling with the passenger. These family members can either be parents, spouses, siblings, or children.