I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of
ggplot2
as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is
lattice
).
Let's look at the "grammar" of
ggplot2
. The 2 basic types of input into
ggplot2
are:
- Aesthetics: x position, y position, shape / size / color of elements
- Elements / geometric shapes: points, lines, bars, text, etc.
A ggplot is invoked using the
ggplot2()
command. The aesthetics are typically passed to the plotting function using the
aes()
function. The logic behind
ggplot2
is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The
layering property of
ggplot2
will become apparent in the examples.
There is also a quick plotting function withing
ggplot2
known as
qplot()
. It is an easier place to start because it is meant to resemble the syntax of the
plot()
command within the base package in R. Let's start with one example of
qplot()
and then shift to
ggplot()
since the functionality is so much richer in the latter.
I decided to draw upon
Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (
see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.
We can create a simple scatterplot as follows:
qplot(log(GDP2000), Polity3, data=norris.data,
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)",
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")
This syntax should be quite similar to that of
plot()
, the only real difference being that we use
size
instead of
cex
to control point size, use an
I()
around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify
alpha=.75
to create some transparency in the points. These are
ggplot
aesthetic names and they differ from those in
plot()
. Also note that we must add
+ opts(legend.position="none")
to remove the legend that would otherwise show up to the right of the plot. The result is:
Now, we can make the exact same graph, but using
ggplot()
:
plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")
In this case, note how we first define
plot2
as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call
plot2
and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.
We can add a regression line to the previous plot as yet another layer as follows:
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
+ scale_x_continuous("Log of Per Capita GDP (2000)") +
geom_abline(color="red")
Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)") +
scale_color_discrete(name = "Regions")
Note that specifically we passed
color=Region8a
as an argument to the initial
aes()
function.
scale_color_discrete(name = "Regions"
was invoked to change the name of the legend.
We can add country names just below the points in the previous plot as follows (again, separate layer):
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) +
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")
If we just want country names and no points:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")+
scale_color_discrete(name = "Regions")
Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying
size=Aid2002
in the original aesthetics:
ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)") +
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")