mean(earnings$wages)
[1] 24.27601
Below is a list of common commands that we use in the class, along with some examples. You can view a help file in RStudio for each command by searching for the command name in the help tab in the lower right panel. You can also just type the name of the command preceded by a “?” into the console. For example, if you wanted to understand how table works, type:
?table
The list here does not contain information about making plots in R. That information is in Appendix C.
Calculate the mean of a quantitative variable. Remember that this command will not work for categorical variables.
mean(earnings$wages)
[1] 24.27601
None of the example datasets that we examine have missing values, but it is important to recognize that if you have missing values in a variable then the mean
command and many of these other commands will return NA
by default rather than a mean. To calculate the mean only for the non-missing cases, you need to add the na.rm=TRUE
argument.
mean(earnings$wages, na.rm=TRUE)
[1] 24.27601
Calculate the median of a quantitative variable. Remember that this command will not work for categorical variables.
median(earnings$wages)
[1] 19.21667
Calculate the standard deviation of a quantitative variable. Remember that this command will not work for categorical variables.
sd(earnings$wages)
[1] 16.23676
Calculate the interquartile range of a quantitative variable. Remember that this command will not work for categorical variables.
IQR(earnings$wages)
[1] 17
Calculate percentiles of a distribution. Remember that this command will not work for categorical variables. By default, the quantile command will return the quartiles (0%, 25%, 50%, 75%, and 100%). If you want different percentiles, you will have to specify the probs
argument.
quantile(earnings$wages)
0% 25% 50% 75% 100%
1.00000 13.00000 19.21667 30.00000 99.99000
#get the 10th and 90th percentile instead
quantile(earnings$wages, probs = c(0.1,0.9))
10% 90%
10.000 47.596
Calculate the absolute frequencies of the categories of a categorical variable.
table(popularity$race)
White Black/African American
2630 1148
Latino Asian/Pacific Islander
405 162
American Indian/Native American Other
26 26
Calculate the proportions (i.e. relative frequencies) of the categories of a categorical variable. This command must be run on the output from a table
command. You can do that in one command by nesting the table
command inside the prop.table
command.
prop.table(table(popularity$race))
White Black/African American
0.598135092 0.261087105
Latino Asian/Pacific Islander
0.092108256 0.036843302
American Indian/Native American Other
0.005913123 0.005913123
Provide a summary of a variable, either categorical or quantitative.
summary(earnings$wages)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 13.00 19.22 24.28 30.00 99.99
summary(popularity$race)
White Black/African American
2630 1148
Latino Asian/Pacific Islander
405 162
American Indian/Native American Other
26 26
You can also feed in an entire dataset to a get a summary of each variable.
summary(earnings)
wages age gender race
Min. : 1.00 Min. :18.00 Male :74384 White :97948
1st Qu.:13.00 1st Qu.:30.00 Female:71263 Black :14259
Median :19.22 Median :40.00 Latino :21116
Mean :24.28 Mean :40.84 Asian : 8347
3rd Qu.:30.00 3rd Qu.:52.00 Indigenous : 1872
Max. :99.99 Max. :64.00 Other/Multiple: 2105
marstat education
Never Married :46629 No HS Diploma : 9614
Married :79001 HS Diploma :64717
Divorced/Separated:17987 AA Degree :16543
Widowed : 2030 Bachelors Degree:35455
Graduate Degree :19318
occup nchild foreign_born
Manual :31458 Min. :0.0000 No :122220
Service :24383 1st Qu.:0.0000 Yes: 23427
Administrative Support :17937 Median :0.0000
Sales :13619 Mean :0.8489
Manager :11642 3rd Qu.:2.0000
Business/Finance Specialist:11036 Max. :9.0000
(Other) :35572
earn_type earningwt
Salary:59397 Min. : 778.3
Wage :86250 1st Qu.: 5459.7
Median :11929.3
Mean :10612.4
3rd Qu.:14564.6
Max. :47137.4
the table
command can be used to create a two-way table, although further work needs to be done to extract useful information from the two-way table.
table(movies$genre, movies$maturity_rating)
G PG PG-13 R
Action 0 7 174 230
Animation 35 188 10 7
Biography 0 36 142 180
Comedy 2 87 499 594
Drama 1 19 82 199
Family 17 192 21 0
Horror 0 2 124 287
Musical 0 13 82 68
Mystery 0 2 21 55
Other 0 0 0 0
Romance 0 14 92 124
Sci-Fi/Fantasy 0 17 322 103
Sport 2 10 24 14
Thriller 0 1 43 175
Western 0 0 8 18
Calculate the conditional distributions from a two-way table. The first argument here must be a two-way table output from the table
command. It is very important that you also add a second argument that indicated the direction you want the conditional distributions. 1 will give you distributions conditional on the row variable and 2 will give you distributions conditional on the column variable.
prop.table(table(movies$genre, movies$maturity_rating), 1)
G PG PG-13 R
Action 0.000000000 0.017031630 0.423357664 0.559610706
Animation 0.145833333 0.783333333 0.041666667 0.029166667
Biography 0.000000000 0.100558659 0.396648045 0.502793296
Comedy 0.001692047 0.073604061 0.422165821 0.502538071
Drama 0.003322259 0.063122924 0.272425249 0.661129568
Family 0.073913043 0.834782609 0.091304348 0.000000000
Horror 0.000000000 0.004842615 0.300242131 0.694915254
Musical 0.000000000 0.079754601 0.503067485 0.417177914
Mystery 0.000000000 0.025641026 0.269230769 0.705128205
Other
Romance 0.000000000 0.060869565 0.400000000 0.539130435
Sci-Fi/Fantasy 0.000000000 0.038461538 0.728506787 0.233031674
Sport 0.040000000 0.200000000 0.480000000 0.280000000
Thriller 0.000000000 0.004566210 0.196347032 0.799086758
Western 0.000000000 0.000000000 0.307692308 0.692307692
Calculate a statistic (e.g. mean, median, sd, IQR) for a quantitative variable across the categories of a categorical variable. The first argument should be the quantitative variable. The second argument should be the categorical variable. The third argument should be the name of the command that will calculate the desired statistic.
tapply(movies$runtime, movies$maturity_rating, mean)
G PG PG-13 R
94.84211 101.00170 109.74148 106.48442
tapply(movies$runtime, movies$maturity_rating, median)
G PG PG-13 R
95 98 107 104
tapply(movies$runtime, movies$maturity_rating, sd)
G PG PG-13 R
10.51324 13.08490 16.98815 16.07952
Calculate the correlation coefficient between two quantitative variables.
cor(movies$rating_imdb, movies$metascore)
[1] 0.7183734
Return the number of observations in a dataset.
nrow(politics)
[1] 4237
Calculate the t-value needed for a confidence interval. For a 95% confidence interval, the first argument should always be 0.975
. The second argument should be the appropriate degrees of freedom for the statistic and dataset.
qt(0.975, nrow(politics)-1)
[1] 1.960524
Calculate the p-value for a hypothesis test. The first argument should always be the negative version of the t-statistic and the second argument should be the appropriate degrees of freedom for the statistic and dataset. Remember that this will always give you the lower tail so you should multiply it by two.
2*pt(-2.1, nrow(politics)-1)
[1] 0.03578783
Run a linear model. The first argument should always be a formula of the form dependent~independent1+independent2+...
. To simplify the writing of variable names, it is often useful to specify a second argument data
that identifies that dataset being used. Then you don’t have to include dataset_name$
in the formula.
Remember to always put the dependent (y) variable on the left hand side of the equation.
#simple model with one independent variable
<- lm(wages~age, data=earnings)
model_simple #same simple model but recenter age on 45 years of age
<- lm(wages~I(age-45), data=earnings)
model_recenter #a model with multiple independent variables, both quantitative and qualitative
<- lm(wages~I(age-45)+education+race+gender+nchild, data=earnings)
model_multiple #a model like the previous but also with interaction between gender and nchild
<- lm(wages~I(age-45)+education+race+gender*nchild, data=earnings) model_interaction
Once a model object is created, information can be extracted with either the coef
command which just reports the slopes and intercept, or a full summary
command which gives more information.
coef(model_interaction)
(Intercept) I(age - 45) educationHS Diploma
17.3568021 0.2242916 4.5382688
educationAA Degree educationBachelors Degree educationGraduate Degree
7.4288321 16.2657784 23.0187910
raceBlack raceLatino raceAsian
-3.4176245 -2.1133582 0.5641751
raceIndigenous raceOther/Multiple genderFemale
-1.5198248 -0.4331997 -4.3777137
nchild genderFemale:nchild
1.2629571 -0.7490706
summary(model_interaction)
Call:
lm(formula = wages ~ I(age - 45) + education + race + gender *
nchild, data = earnings)
Residuals:
Min 1Q Median 3Q Max
-43.638 -7.779 -2.198 4.568 90.578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.356802 0.159721 108.669 < 2e-16 ***
I(age - 45) 0.224292 0.002858 78.471 < 2e-16 ***
educationHS Diploma 4.538269 0.154451 29.383 < 2e-16 ***
educationAA Degree 7.428832 0.181143 41.011 < 2e-16 ***
educationBachelors Degree 16.265778 0.164396 98.943 < 2e-16 ***
educationGraduate Degree 23.018791 0.178161 129.202 < 2e-16 ***
raceBlack -3.417625 0.123798 -27.607 < 2e-16 ***
raceLatino -2.113358 0.109491 -19.302 < 2e-16 ***
raceAsian 0.564175 0.157602 3.580 0.000344 ***
raceIndigenous -1.519825 0.321284 -4.730 2.24e-06 ***
raceOther/Multiple -0.433200 0.303134 -1.429 0.152987
genderFemale -4.377714 0.090203 -48.532 < 2e-16 ***
nchild 1.262957 0.043476 29.049 < 2e-16 ***
genderFemale:nchild -0.749071 0.063268 -11.840 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.74 on 145633 degrees of freedom
Multiple R-squared: 0.284, Adjusted R-squared: 0.2839
F-statistic: 4443 on 13 and 145633 DF, p-value: < 2.2e-16
Used for rounding the results of numbers to a given number of decimal places. By default, it will round to whole numbers, but you can specify the number of decimal places in the second argument.
100*round(prop.table(table(movies$genre)), 3)
Action Animation Biography Comedy Drama
9.5 5.5 8.2 27.2 6.9
Family Horror Musical Mystery Other
5.3 9.5 3.8 1.8 0.0
Romance Sci-Fi/Fantasy Sport Thriller Western
5.3 10.2 1.2 5.0 0.6
Sort a vector of numbers from smallest to largest (default), or largest to smallest (with additional argument decreasing=TRUE
).
sort(100*round(prop.table(table(movies$genre)), 3), decreasing = TRUE)
Comedy Sci-Fi/Fantasy Action Horror Biography
27.2 10.2 9.5 9.5 8.2
Drama Animation Family Romance Thriller
6.9 5.5 5.3 5.3 5.0
Musical Mystery Sport Western Other
3.8 1.8 1.2 0.6 0.0
sort(100*round(prop.table(table(movies$genre)), 3))
Other Western Sport Mystery Musical
0.0 0.6 1.2 1.8 3.8
Thriller Family Romance Animation Drama
5.0 5.3 5.3 5.5 6.9
Biography Action Horror Sci-Fi/Fantasy Comedy
8.2 9.5 9.5 10.2 27.2