ggplot2
as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is lattice
).Let's look at the "grammar" of
ggplot2
. The 2 basic types of input into ggplot2
are:- Aesthetics: x position, y position, shape / size / color of elements
- Elements / geometric shapes: points, lines, bars, text, etc.
ggplot2()
command. The aesthetics are typically passed to the plotting function using the aes()
function. The logic behind ggplot2
is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of ggplot2
will become apparent in the examples.There is also a quick plotting function withing
ggplot2
known as qplot()
. It is an easier place to start because it is meant to resemble the syntax of the plot()
command within the base package in R. Let's start with one example of qplot()
and then shift to ggplot()
since the functionality is so much richer in the latter.I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.
We can create a simple scatterplot as follows:
qplot(log(GDP2000), Polity3, data=norris.data, ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)", pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")
This syntax should be quite similar to that of
plot()
, the only real difference being that we use size
instead of cex
to control point size, use an I()
around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify alpha=.75
to create some transparency in the points. These are ggplot
aesthetic names and they differ from those in plot()
. Also note that we must add + opts(legend.position="none")
to remove the legend that would otherwise show up to the right of the plot. The result is: Now, we can make the exact same graph, but using
ggplot()
:plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3)) plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)")
In this case, note how we first define
plot2
as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call plot2
and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.We can add a regression line to the previous plot as yet another layer as follows:
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)") + geom_abline(color="red")
Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) + geom_point(shape=20, size=3) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)") + scale_color_discrete(name = "Regions")
Note that specifically we passed
color=Region8a
as an argument to the initial aes()
function. scale_color_discrete(name = "Regions"
was invoked to change the name of the legend.We can add country names just below the points in the previous plot as follows (again, separate layer):
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) + geom_text(aes(label=Natmap), size=2.5, vjust=2.5) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")
If we just want country names and no points:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+ geom_text(aes(label=Natmap), size=2.5) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")
Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying
size=Aid2002
in the original aesthetics:ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) + scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + scale_x_continuous("Log of Per Capita GDP (2000)") + geom_text(aes(label=Natmap), color=I("black")) + labs(size="2002 Aid as \n % of GDP")
No comments:
Post a Comment