Appendix B — Common R Commands

Below is a list of common commands that we use in the class, along with some examples. You can view a help file in RStudio for each command by searching for the command name in the help tab in the lower right panel. You can also just type the name of the command preceded by a “?” into the console. For example, if you wanted to understand how table works, type:

?table

The list here does not contain information about making plots in R. That information is in Appendix C.

Univariate statistics

mean

Calculate the mean of a quantitative variable. Remember that this command will not work for categorical variables.

mean(earnings$wages)
[1] 24.27601

None of the example datasets that we examine have missing values, but it is important to recognize that if you have missing values in a variable then the mean command and many of these other commands will return NA by default rather than a mean. To calculate the mean only for the non-missing cases, you need to add the na.rm=TRUE argument.

mean(earnings$wages, na.rm=TRUE)
[1] 24.27601

median

Calculate the median of a quantitative variable. Remember that this command will not work for categorical variables.

median(earnings$wages)
[1] 19.21667

sd

Calculate the standard deviation of a quantitative variable. Remember that this command will not work for categorical variables.

sd(earnings$wages)
[1] 16.23676

IQR

Calculate the interquartile range of a quantitative variable. Remember that this command will not work for categorical variables.

IQR(earnings$wages)
[1] 17

quantile

Calculate percentiles of a distribution. Remember that this command will not work for categorical variables. By default, the quantile command will return the quartiles (0%, 25%, 50%, 75%, and 100%). If you want different percentiles, you will have to specify the probs argument.

quantile(earnings$wages)
      0%      25%      50%      75%     100% 
 1.00000 13.00000 19.21667 30.00000 99.99000 
#get the 10th and 90th percentile instead
quantile(earnings$wages, probs = c(0.1,0.9))
   10%    90% 
10.000 47.596 

table

Calculate the absolute frequencies of the categories of a categorical variable.

table(popularity$race)

                          White          Black/African American 
                           2630                            1148 
                         Latino          Asian/Pacific Islander 
                            405                             162 
American Indian/Native American                           Other 
                             26                              26 

prop.table

Calculate the proportions (i.e. relative frequencies) of the categories of a categorical variable. This command must be run on the output from a table command. You can do that in one command by nesting the table command inside the prop.table command.

prop.table(table(popularity$race))

                          White          Black/African American 
                    0.598135092                     0.261087105 
                         Latino          Asian/Pacific Islander 
                    0.092108256                     0.036843302 
American Indian/Native American                           Other 
                    0.005913123                     0.005913123 

summary

Provide a summary of a variable, either categorical or quantitative.

summary(earnings$wages)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   13.00   19.22   24.28   30.00   99.99 
summary(popularity$race)
                          White          Black/African American 
                           2630                            1148 
                         Latino          Asian/Pacific Islander 
                            405                             162 
American Indian/Native American                           Other 
                             26                              26 

You can also feed in an entire dataset to a get a summary of each variable.

summary(earnings)
     wages            age           gender                  race      
 Min.   : 1.00   Min.   :18.00   Male  :74384   White         :97948  
 1st Qu.:13.00   1st Qu.:30.00   Female:71263   Black         :14259  
 Median :19.22   Median :40.00                  Latino        :21116  
 Mean   :24.28   Mean   :40.84                  Asian         : 8347  
 3rd Qu.:30.00   3rd Qu.:52.00                  Indigenous    : 1872  
 Max.   :99.99   Max.   :64.00                  Other/Multiple: 2105  
                                                                      
               marstat                 education    
 Never Married     :46629   No HS Diploma   : 9614  
 Married           :79001   HS Diploma      :64717  
 Divorced/Separated:17987   AA Degree       :16543  
 Widowed           : 2030   Bachelors Degree:35455  
                            Graduate Degree :19318  
                                                    
                                                    
                         occup           nchild       foreign_born
 Manual                     :31458   Min.   :0.0000   No :122220  
 Service                    :24383   1st Qu.:0.0000   Yes: 23427  
 Administrative Support     :17937   Median :0.0000               
 Sales                      :13619   Mean   :0.8489               
 Manager                    :11642   3rd Qu.:2.0000               
 Business/Finance Specialist:11036   Max.   :9.0000               
 (Other)                    :35572                                
  earn_type       earningwt      
 Salary:59397   Min.   :  778.3  
 Wage  :86250   1st Qu.: 5459.7  
                Median :11929.3  
                Mean   :10612.4  
                3rd Qu.:14564.6  
                Max.   :47137.4  
                                 

Bivariate statistics

table

the table command can be used to create a two-way table, although further work needs to be done to extract useful information from the two-way table.

table(movies$genre, movies$maturity_rating)
                
                   G  PG PG-13   R
  Action           0   7   174 230
  Animation       35 188    10   7
  Biography        0  36   142 180
  Comedy           2  87   499 594
  Drama            1  19    82 199
  Family          17 192    21   0
  Horror           0   2   124 287
  Musical          0  13    82  68
  Mystery          0   2    21  55
  Other            0   0     0   0
  Romance          0  14    92 124
  Sci-Fi/Fantasy   0  17   322 103
  Sport            2  10    24  14
  Thriller         0   1    43 175
  Western          0   0     8  18

prop.table

Calculate the conditional distributions from a two-way table. The first argument here must be a two-way table output from the table command. It is very important that you also add a second argument that indicated the direction you want the conditional distributions. 1 will give you distributions conditional on the row variable and 2 will give you distributions conditional on the column variable.

prop.table(table(movies$genre, movies$maturity_rating), 1)
                
                           G          PG       PG-13           R
  Action         0.000000000 0.017031630 0.423357664 0.559610706
  Animation      0.145833333 0.783333333 0.041666667 0.029166667
  Biography      0.000000000 0.100558659 0.396648045 0.502793296
  Comedy         0.001692047 0.073604061 0.422165821 0.502538071
  Drama          0.003322259 0.063122924 0.272425249 0.661129568
  Family         0.073913043 0.834782609 0.091304348 0.000000000
  Horror         0.000000000 0.004842615 0.300242131 0.694915254
  Musical        0.000000000 0.079754601 0.503067485 0.417177914
  Mystery        0.000000000 0.025641026 0.269230769 0.705128205
  Other                                                         
  Romance        0.000000000 0.060869565 0.400000000 0.539130435
  Sci-Fi/Fantasy 0.000000000 0.038461538 0.728506787 0.233031674
  Sport          0.040000000 0.200000000 0.480000000 0.280000000
  Thriller       0.000000000 0.004566210 0.196347032 0.799086758
  Western        0.000000000 0.000000000 0.307692308 0.692307692

tapply

Calculate a statistic (e.g. mean, median, sd, IQR) for a quantitative variable across the categories of a categorical variable. The first argument should be the quantitative variable. The second argument should be the categorical variable. The third argument should be the name of the command that will calculate the desired statistic.

tapply(movies$runtime, movies$maturity_rating, mean)
        G        PG     PG-13         R 
 94.84211 101.00170 109.74148 106.48442 
tapply(movies$runtime, movies$maturity_rating, median)
    G    PG PG-13     R 
   95    98   107   104 
tapply(movies$runtime, movies$maturity_rating, sd)
       G       PG    PG-13        R 
10.51324 13.08490 16.98815 16.07952 

cor

Calculate the correlation coefficient between two quantitative variables.

cor(movies$rating_imdb, movies$metascore)
[1] 0.7183734

Statistical inference

nrow

Return the number of observations in a dataset.

nrow(politics)
[1] 4237

qt

Calculate the t-value needed for a confidence interval. For a 95% confidence interval, the first argument should always be 0.975. The second argument should be the appropriate degrees of freedom for the statistic and dataset.

qt(0.975, nrow(politics)-1)
[1] 1.960524

pt

Calculate the p-value for a hypothesis test. The first argument should always be the negative version of the t-statistic and the second argument should be the appropriate degrees of freedom for the statistic and dataset. Remember that this will always give you the lower tail so you should multiply it by two.

2*pt(-2.1, nrow(politics)-1)
[1] 0.03578783

Linear models

lm

Run a linear model. The first argument should always be a formula of the form dependent~independent1+independent2+.... To simplify the writing of variable names, it is often useful to specify a second argument data that identifies that dataset being used. Then you don’t have to include dataset_name$ in the formula.

Remember to always put the dependent (y) variable on the left hand side of the equation.

#simple model with one independent variable
model_simple <- lm(wages~age, data=earnings)
#same simple model but recenter age on 45 years of age
model_recenter <- lm(wages~I(age-45), data=earnings)
#a model with multiple independent variables, both quantitative and qualitative
model_multiple <- lm(wages~I(age-45)+education+race+gender+nchild, data=earnings)
#a model like the previous but also with interaction between gender and nchild
model_interaction <- lm(wages~I(age-45)+education+race+gender*nchild, data=earnings)

Once a model object is created, information can be extracted with either the coef command which just reports the slopes and intercept, or a full summary command which gives more information.

coef(model_interaction)
              (Intercept)               I(age - 45)       educationHS Diploma 
               17.3568021                 0.2242916                 4.5382688 
       educationAA Degree educationBachelors Degree  educationGraduate Degree 
                7.4288321                16.2657784                23.0187910 
                raceBlack                raceLatino                 raceAsian 
               -3.4176245                -2.1133582                 0.5641751 
           raceIndigenous        raceOther/Multiple              genderFemale 
               -1.5198248                -0.4331997                -4.3777137 
                   nchild       genderFemale:nchild 
                1.2629571                -0.7490706 
summary(model_interaction)

Call:
lm(formula = wages ~ I(age - 45) + education + race + gender * 
    nchild, data = earnings)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.638  -7.779  -2.198   4.568  90.578 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               17.356802   0.159721 108.669  < 2e-16 ***
I(age - 45)                0.224292   0.002858  78.471  < 2e-16 ***
educationHS Diploma        4.538269   0.154451  29.383  < 2e-16 ***
educationAA Degree         7.428832   0.181143  41.011  < 2e-16 ***
educationBachelors Degree 16.265778   0.164396  98.943  < 2e-16 ***
educationGraduate Degree  23.018791   0.178161 129.202  < 2e-16 ***
raceBlack                 -3.417625   0.123798 -27.607  < 2e-16 ***
raceLatino                -2.113358   0.109491 -19.302  < 2e-16 ***
raceAsian                  0.564175   0.157602   3.580 0.000344 ***
raceIndigenous            -1.519825   0.321284  -4.730 2.24e-06 ***
raceOther/Multiple        -0.433200   0.303134  -1.429 0.152987    
genderFemale              -4.377714   0.090203 -48.532  < 2e-16 ***
nchild                     1.262957   0.043476  29.049  < 2e-16 ***
genderFemale:nchild       -0.749071   0.063268 -11.840  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.74 on 145633 degrees of freedom
Multiple R-squared:  0.284, Adjusted R-squared:  0.2839 
F-statistic:  4443 on 13 and 145633 DF,  p-value: < 2.2e-16

Miscellaneous

round

Used for rounding the results of numbers to a given number of decimal places. By default, it will round to whole numbers, but you can specify the number of decimal places in the second argument.

100*round(prop.table(table(movies$genre)), 3)

        Action      Animation      Biography         Comedy          Drama 
           9.5            5.5            8.2           27.2            6.9 
        Family         Horror        Musical        Mystery          Other 
           5.3            9.5            3.8            1.8            0.0 
       Romance Sci-Fi/Fantasy          Sport       Thriller        Western 
           5.3           10.2            1.2            5.0            0.6 

sort

Sort a vector of numbers from smallest to largest (default), or largest to smallest (with additional argument decreasing=TRUE).

sort(100*round(prop.table(table(movies$genre)), 3), decreasing = TRUE)

        Comedy Sci-Fi/Fantasy         Action         Horror      Biography 
          27.2           10.2            9.5            9.5            8.2 
         Drama      Animation         Family        Romance       Thriller 
           6.9            5.5            5.3            5.3            5.0 
       Musical        Mystery          Sport        Western          Other 
           3.8            1.8            1.2            0.6            0.0 
sort(100*round(prop.table(table(movies$genre)), 3))

         Other        Western          Sport        Mystery        Musical 
           0.0            0.6            1.2            1.8            3.8 
      Thriller         Family        Romance      Animation          Drama 
           5.0            5.3            5.3            5.5            6.9 
     Biography         Action         Horror Sci-Fi/Fantasy         Comedy 
           8.2            9.5            9.5           10.2           27.2