survival | sex | age | agegroup | pclass | fare | family |
---|---|---|---|---|---|---|
Survived | Female | 24.0000 | Adult | First | 69.3000 | 0 |
Died | Male | 24.0000 | Adult | Third | 7.7958 | 0 |
Survived | Male | 0.9167 | Child | First | 151.5500 | 3 |
Died | Male | 60.0000 | Adult | First | 26.5500 | 0 |
1 Understanding Data
In this first module, we will cover what it actually means to have “data” and give a broad overview of what kinds of things we can do with data. Data are the foundation of any statistical analysis and most data that we use in the social sciences consist of variables measured on some observations. In the next two sections, we will learn more about these concepts.
Slides for this module can be found here.
What Does Data Look Like?
The data that we look at typically take the format of a “spreadsheet” with rows and columns. Table 1.1 below shows some characteristics of four randomly drawn passengers from the Titanic, in this type of spreadsheet format.
Clearly, we can see variation in who survived and died, the passenger classes they were housed in, gender, and age. We also have a measure of the fare they paid for the trip (in English pounds) and the number of family members traveling with them. To understand how to think about data, we need to understand the concepts of an observation and a variable and the distinction between them.
The observations
The observations are what you have on the rows of your dataset. In the Titanic example, the observations are individual passengers on board the Titanic, but observations can take many different forms.
We use the term unit of analysis to distinguish what kind of observation you have in your dataset. If you are interviewing individual people and recording their responses, then the unit of analysis is individual people. If you are collecting cross-national data by country, then the unit of analysis would be a country. If you are analyzing data on the “best colleges in the US” then the unit of analysis is a university/college. The most common unit of analysis that we will see in this course is an individual person, but several of our datasets involve other units of analysis and it is important to keep in mind that an observation can be many different kinds of things.
The variables
The variables are what you have on the columns of your dataset. Variables measure specific attributes of your observations. If you conduct a survey of individual people and ask them for their age, gender, and education, then these three attributes would be recorded as variables in your dataset. We refer to these attributes as “variables” because they can take different values across the observations. If you were to conduct a survey of individual people and ask your respondents if they are human, then you probably wouldn’t have a proper variable because everyone would respond “yes” and therefore you would observe no variation.
There are two major types of variables. Some variables measure quantities of something and thus can be represented by a number. We refer to these as quantitative variables. Other variables indicate a category to which the observation belongs. We refer to these as categorical variables.
Quantitative variables
Quantitative variables measure quantities of something. A person’s height, a worker’s hourly wage, the number of children that a woman has birthed, a country’s gross domestic product, a US state’s poverty rate, and the percent of a university’s student body that are women are all examples of quantitative variables. They can all be represented by a number which indicates how much of the thing the observation has.
Quantitative variables can be further divided into two important sub-types: discrete and continuous variables. Discrete variables can logically only take certain values within a range. The most common example of a discrete variable is a count variable. The number of children that a woman has birthed is an example of a count variable. This number can only take the value of whole numbers (integers) such as 0, 1, 2, 3, and so on. It would make no sense for a respondent to say they had given birth to 2.5 children. Count variables are discrete variables because only whole numbers are logical responses.
On the other hand, continuous variables can logically take any value within a range. A person’s height is an example of a continuous variable. It is true that we typically measure height only down to a certain level of precision such as inches. We might think that if we were to measure a person’s height in inches, it would only take whole number values and therefore be a discrete variable. However, limitations in measurement precision don’t define whether a variable is continuous or discrete. Rather the distinction is whether the value could be logically measured to any degree of accuracy. We often measure height out to half inches and we could imagine that if we have a precise enough measurement instrument, we could measure a person’s height out to any decimal level that we desired. So, it is perfectly sensible for someone to say they were 69.825467 inches tall, even though we might think they are being a bit tedious.
In both the discrete and continuous cases, you migh notice that I said “within a range.” Depending on the variable, there are also often logical limits on minimum and maximum values. For example, you can’t have negative children or height. While we have no exact upper limits to the values that either variable can take, but we would likely think we have a data coding error if we saw a report of a 20 foot person or a woman who gave birth to 50 children. In general, both discrete and continuous variables can be limited in the range of values that they can take. What distinguishes them from each other is what values they can logically take within that limited range.
Categorical variables
Categorical variables are not represented by numerical quantities but rather by a set of mutually exclusive categories to which observations can belong. The gender, race, political party affiliation, and highest educational degree of a person, the public/private status of a university, and the passenger class of a passenger on the Titanic are all examples of categorical variables.
Categorical variables can also be sub-divided into two sub-types. Ordinal variables are categorical variables in which the categories have an explicit and logical ordered structure, while nominal variables are categorical variables in which the categories are unordered. Highest educational degree is an example of an ordinal variables because it is ordered such that Graduate Degree > BA Degree > AA Degree > High School Diploma > No degree. Passenger class is also an ordinal variables that starts in Third class (or steerage - Think Leonardo DiCaprio) and ends in First class (think Kate Winslet), with a Second class in between.
Race, gender, and political party affiliation are all examples of nominal variables because The categories of these variables have no logical ordering. While some people might have their own political party preferences, these sort of normative evaluations of categories are irrelevant. For the same reason, even the variable of survival on the Titanic is a nominal variable. We don’t judge the value of life and death, we just record it!
Using course example datasets in R
Throughout this book and the accompanying slides, I will use several example datasets. If you are taking this course from me, you will also access these datasets to complete assignments. More information about these datasets can be found in Appendix A, but here I provide a brief overview of each dataset. You should take the time to familiarize yourself with all of the details in Appendix A.
- Crimes
- The crimes data contain information on violent and property crime rates and demographic variables for all fifty US states and the District of Columbia. The crime rates are averaged over the years 2014-2018 and come from the FBI’s Uniform Crime Reports (UCR). The demographic variables include information on poverty rates, education levels, income inequality, and median income.
- Earnings
- This data has information on the hourly wages of US workers in 2018. The data here are extracted from Current Population Survey. We will use it to look at the relationship between a variety of demographic variables and how much a person earns.
- Movies
- The movie data contain information about 4,343 full feature English language movies produced in the US between 2000 and 2021. The data come from the Internet Movie Database and have been supplemented with extra information from the Open Movie Database. Variables include box office returns, movie runtime, maturity ratings, and viewer and critic review scores.
- Politics
- This data comes from the 2016 American National Election Study (ANES). The ANES is a survey of the American electorate that is conducted every two years. The study collects information on a variety of political attitudes and voting behaviors, including each respondent’s presidential vote.
- Popularity
- This data comes from the National Longitudinal Study of Adolescent to Adult Health (Add Health), conducted by the Carolina Population Center at UNC-Chapel Hill and supported by a grant from the National Institute of Child Health and Human Development. The sample we will use includes all students in 16 high schools in 1994-95. Students were asked to nominate their friends in schools and we will use the number of friend nominations received as an indicator of popularity.
- Sex
- The sex data come from the General Social Survey (GSS) for the years between 2014 and 2021. Respondents were asked about their sexual frequency and we will use their responses to explore the relationship between sexual frequency and a variety of demographic variables.
- Titanic
- The titanic data contain information on all 1,309 passengers aboard the Titanic. The data do not include information about the crew. The data primarily come from the online database, Encyclopedia Titanica.
If you are taking this course from me, you will have access to these datasets through a Posit Cloud project. You can also get the datasets directly here.
The dataset files all have an *.RData extension. You can load a dataset into RStudio in one of two ways. From within RStudio, you can click on the file in your Files tab and you will be prompted to load the file. Alternatively, you can use the load
command to load the file. For example, if I wanted to load the file sex.RData
, I would type the following into the R console:
load("sex.RData")
load
is an example of an R command (or function), which I will have more to say about in the next module.
This command will only work if the file is in the working directory shown at the top of the R console. Typically, your working directory will be the top-level directory of the projects we work in. If for example, your dataset files are in a subdirectory of the working directory called input
, you would instead need to type:
load("input/sex.RData")
If you alternatively click on the dataset to load it, R will run the load
command with the correct path to the data. This can be useful if you are having trouble providing the correct path to your data.
Once the load is successful, you will see an object titled sex
in your Environment tab. To take a glance at this dataset, we can just type its name into the R prompt:
sex
# A tibble: 11,785 × 7
sexf gender age marital sexorient relig educ
<dbl> <fct> <dbl> <fct> <fct> <fct> <dbl>
1 29 Male 53 Divorced Heterosexual Catholic 16
2 138 Female 26 Married Heterosexual Catholic 16
3 26 Male 59 Divorced Heterosexual Evangelical Protestant 13
4 50 Female 56 Married Heterosexual Catholic 16
5 44 Female 74 Married Heterosexual Catholic 17
6 68 Female 56 Married Heterosexual Mainline Protestant 17
7 1 Male 63 Married Heterosexual Catholic 12
8 97 Male 34 Married Heterosexual Catholic 17
9 29 Female 37 Never married Heterosexual Catholic 10
10 112 Female 30 Married Heterosexual Catholic 15
# … with 11,775 more rows
By default, only the first ten rows will be shown. You can also click on the dataset in your Environment tab to open up a data viewer in RStudio that gives you a fuller picture.
I can see that this dataset contains multiple variables. What if I want to reference a specific variable in this dataset? In R, we do this with the following syntax:
$variable_name dataset_name
To reference a particular variable, we use the dollar sign after the dataset name and then follow that with the variable name. We will use this syntax a lot in this course. For example, lets say we want to calculate the mean (which we will learn more about in the next module) of the educ
variable. It turns out that there is a mean
function that expects us to feed in a variable. So,
mean(sex$educ)
[1] 14.07849
If you type this command in RStudio, you will notice that RStudio will start offering possible variable names as soon as you enter in the $
.
One of the most common problems students first have when learning R looks something like:
in mean(sex$educ) : object 'sex' not found Error
What is R trying to tell you here? Its telling you that your sex
dataset was not found. Remember that you have to load your dataset as a first step. It seems simple, but I guarantee you will forget sometimes (I sure do). Just remember that, when you see error messages like this one, don’t panic. Its a simple mistake and R is telling you exactly what the problem is. Just load the dataset and you should be able to continue on.
What Can We Do With Data?
We now understand the structure of data better, but what do social scientists do with this kind of data? In the first part of this course, we will learn three fundamental data analysis tasks: analysis of the distribution of a single variable, measuring association, and statistical inference. In the final part of the course, we will build on these fundamentals to learn how to build more complex statistical models.
How is a variable distributed?
Sometimes, we just want to understand what a single variable “looks like.” We may be interested in its “average” value or we may want to know something else, like how spread out the values of the variable are. Alternatively, we may just want to get some visual sense of where the values of this variable fall. In these cases, we calculate univariate (latin for “one variable”) statistics on the distribution of a variable and create figures that graphically show us what these distributions look like. Typically, univariate statistics aren’t as interesting to social scientists as the measures of association discussed below, but even if our ultimate goal is something more complex, it is always a good idea to start our analysis by examining univariate statistics and looking at distributions to understand all of the variables in our research.
In some cases, the calculation of a univariate statistics is the important question at hand. For example, when political pollsters try to figure out who is going to win an election, they are very much interested in the univariate distribution of support for each candidate, which gives the proportions of likely voters who intend to vote for each candidate. Here are some other questions we could ask about the distribution of variables in our datasets:
- How much variability is there in the amount of money that movies make?
- What percent of passengers survived the Titanic disaster?
- What is the average age of voters in the United States?
Measuring association
Social scientists are often most interested in the relationships, or association, between two or more variables. These associations allow us to test hypotheses about causal relationships between underlying social constructs. For example, we might be interested in whether divorce affects children’s well-being. In this case, we would want to examine at the relationship between a categorical variable indicating whether a child’s parents were divorced and some measure of their well-being, such as feelings of stress, academic performance, etc. Here are some questions about association we could ask in our data:
- Did the probability of surviving the Titanic depend on passenger class? (categorical and categorical)
- Do the earnings of movies vary by genre? (quantitative and categorical)
- Is income inequality in a state related to its crime rate? (quantitative and quantitative)
How me measure association depends very much on whether the variables are categorical or quantitative. Later this term, we will learn different techniques for measuring association between two quantitative variables, two categorical variables, and a categorical and quantitative variable. So, you should always ask yourself “what kind of variables do I have” before trying to measuring association to make sure you are using the right methods.
Making statistical inferences
If I told you that in I had a sample of twenty people, brown-eyed individuals make $5 more than all other eye colors combined, would you believe I was capturing some thing real? You probably shouldn’t, because in a sample of twenty people, odd results like this are not unlikely just by random chance. If I told you I observed this phenomenon on a well-drawn sample of 20,000 individuals, you would probably be more likely to believe me.
The statistical concept underlying our intuition here is called statistical inference. We often draw samples of observations from a larger population and want to know what is happening in the population. Statistical inference is the technique of quantifying how uncertain we are about whether our sample data are similar to the population or not. When you hear press reports on political polls use the term “margin of error,” they are referring to statistical inference.
Many introductory statistics course focus most of their attention on statistical inference, partly because it is more abstract and complex. However, statistical inference is always secondary to the basic descriptive measures of univariate, bivariate, and multivariate statistics. Therefore, I spend considerably less time on this topic than in most statistics courses, so that we can focus on the more important stuff.
Building Models
Although our basic measures of association are useful, the most common tool in social science analysis is a statistical model in which the user can specify the relationships between variables by some kind of mathematical functions. In the final module of the course, we will learn how to build basic versions of these models that allow us to examine the relationships between multiple quantitative and categorical variables. This module will build on our work in tthe previous modules. We will specifically focus on two uses of statistical models.
First, statistical models allow us to “control” for other variables when we look at the association between any two variables. Controlling for other variables is important because these other variables may be confounded with the relationship we want to measure. For example, we may be interested in the relationship between marital status (e.g. never married, married, widowed, divorced) and sexual frequency in our GSS data. However, these different groups vary significantly in their age. Never married individuals are much younger than all of the other groups and widowed individuals are much older. Given the fact that sexual frequency tends to decline with age (something we will show later in this term), it seems problematic to just compare the average sexual frequency across these groups because this advantages the never-married and disadvantages the widowed. In this case, age confounds the relationship between marital status and sexual frequency. Statistical models will gives us tools to account for this problem and to get a better estimate of the relationship between marital status and sexual frequency, net of this confounding influence.
Second, statistical models will allow us to account for how the relationship between two variables might differ depending on the context of a third variable. This is what we call an interaction. For example, lets say we were interested in the relationship between the number of sports played and a student’s popularity (measured by friend nominations) in our example data. Because of gender norms, we might suspect this relationship to be different for boys and girls. We can use statistical models to empirically examine whether this suspicion is correct. This kind of contextualization is an important component of sociological practice.
Observational Data, Experimental Thinking
Much of the data that we use in sociology is observational rather than experimental. In an experimental design, the researcher randomly selects subjects to receive some sort of treatment and then observes how this treatment affects some outcome. Thus, the research engages in systematic manipulation to observe a response. In observational data, the researcher does not directly manipulate the environment but rather just observes and records the social setting as it is.
Experimental data can be more powerful than observational data because the random assignment of a treatment through researcher manipulation strengthens claims of causation. If the researcher observes a relationship between the treatment and outcome, they know that it must either be causal or a result of random chance because the assignment of the treatment was randomly determined. In observational data, the relationship between any two variables can also be a spurious relationship. Spuriousness occurs when another confounding variable or variables produce the relationship between the two observed variables rather than them directly causing each other. The example above about marital status and sexual frequency is a simple example of this problem. If we note that widows have less sex than other people, we may be tempted to think that something about being widowed reduces someone’s sexual drive or their interactions with others. However, the more obvious explanation is that widows tend to be quite a bit older than other marital status groups and older people have less sex. Age is generating a spurious relationship between widowhood and sexual frequency. This spuriousness problem is the reason for the frequent claim that “correlation is not causation.”
There are two different philosophical approaches to the statistical analysis of observational data where spuriousness can be a problem. The first approach treates our data and methods in a pseudo-experimental manner. The goal of this approach is to try to find ways to mimic the experimental design approach with observational data. At a basic level this can include “controlling” for other variables (which we will learn) and can extend to a variety of techniques of causal modeling that are intending to use some feature of the data to recover causation (which are beyond the scope of what we will learn in this course).
The second approach treats statistical analysis as a way to describe observed data in a formal, systematic, and replicable way. The goal is to establish to what extent the data are consistent with competing theories that seek to understand the outcome in question, rather than to mimic the experimental approach. Although quantitative and qualitative approaches are often seen as philosophically different approaches, this approach to observational data shares many features with more purely qualitative approaches to data analysis. This is the approach that I take in this course.