2/12/2012

Better graphics in R: intro to ggplot2

I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of ggplot2 as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is lattice).

Let's look at the "grammar" of ggplot2. The 2 basic types of input into ggplot2 are:
  • Aesthetics: x position, y position, shape / size / color of elements
  • Elements / geometric shapes: points, lines, bars, text, etc.
A ggplot is invoked using the ggplot2() command. The aesthetics are typically passed to the plotting function using the aes() function. The logic behind ggplot2 is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of ggplot2 will become apparent in the examples.

There is also a quick plotting function withing ggplot2 known as qplot(). It is an easier place to start because it is meant to resemble the syntax of the plot() command within the base package in R. Let's start with one example of qplot() and then shift to ggplot() since the functionality is so much richer in the latter.

I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.

We can create a simple scatterplot as follows:
qplot(log(GDP2000), Polity3, data=norris.data, 
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)", 
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")

This syntax should be quite similar to that of plot(), the only real difference being that we use size instead of cex to control point size, use an I() around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify alpha=.75 to create some transparency in the points. These are ggplot aesthetic names and they differ from those in plot(). Also note that we must add + opts(legend.position="none") to remove the legend that would otherwise show up to the right of the plot. The result is:


Now, we can make the exact same graph, but using ggplot():

plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")

In this case, note how we first define plot2 as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call plot2 and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.

We can add a regression line to the previous plot as yet another layer as follows:
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
 + scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_abline(color="red")



Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
scale_color_discrete(name = "Regions")

Note that specifically we passed color=Region8a as an argument to the initial aes() function. scale_color_discrete(name = "Regions" was invoked to change the name of the legend.


We can add country names just below the points in the previous plot as follows (again, separate layer):
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) + 
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")


If we just want country names and no points:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+ 
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+  
scale_color_discrete(name = "Regions")

Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying size=Aid2002 in the original aesthetics:

ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")

No comments:

Post a Comment