This appendix will provide ggplot example R code and output for of all the graphs that we might use this term. For further information, I highly recommend Kieran Healy’s Data Visualization book and Hadley Wikham’s ggplot2 book.
All the examples provided will use the standard example datasets that we have been working with throughout the term.
Barplots
Barplots are used to show the distribution of a single categorical variable.
You only need to specify the variable you want in the x aesthetic.
2
The geom you want is geom_histogram. You can vary the binwidth size with the binwidth argument. I am also using the col argument to change the color of my column borders.
You can also calculate density instead of count, if you prefer, by adding the y=after_state(density) option to the aesthetics. Density is the proportion of cases in a bin, divided by the bin width. If you plot a histogram with density, you can also overlay this figure with a smoothed line approximating the distribution called a kernel density.
I use y=after_stat(density) to get density instead of absolute count. In practice the histogram will look the same, but this allows me to overlay the kernel density figure.
2
The geom_density geom will plot a kernel density smoother which basically just smoothes out a histogram. I made it semi-transparent with alpha=0.5 so that you can see the histogram beneath it.
Boxplots
You can also plot quantitative distributions using a boxplot.
We specify the variable we want with y not x. I also find it useful to set x="" to avoid odd tickmark labels on the x-axis.
2
The geom we want is geom_boxplot. I like making the outlier.color a bright color.
Comparative Barplots
Comparative barplots allow us to compare the distribution of two categorical variables. Basically, we plot the conditional distribution of one of these variables across the other variable.
By faceting
The first approach to the comparative barplot is to use faceting to make multiple panels.
The big change from a univariate boxplot is that we add the categorical variable as the x aesthetic. In general, its a good idea to reorder the categories of that categorical variable from highest median to lowest, as I do with the reorder command.
2
The coord_flip command is not always required but is useful if the category labels of your x-axis are running into one another.
Scatterplots
Scatterplots are used to examine the relationship between two quantitative variables.
1ggplot(crimes, aes(x=unemploy_rate, y=property_rate))+2geom_point()+labs(x="unemployment rate",y="property crimes per 100,000 population")+theme_bw()
1
We specify the independent variable with the x aesthetic and the dependent variable with the y aesthetic.
2
The geom we want is geom_point.
Semi-transparency
With large datasets, a scatterplot can have a lot of overplotting with similar points next to each other or on top of each other which makes it difficult to see the density of points. One way to address this is to add semi-transparency to geom_point.
Another option to help with overplotting problems is to replace geom_point with geom_jitter that adds a little bit of random noise to each point. Often jittering and semi-transparency work well together.
I can also add a line to the plot with the geom_smooth command. If I specify method="lm", I will get a straight line. Otherwise the line will be allowed to bend to accomodate non-linearity.