2/25/2012

Trellis graphs in ggplot2

Trellis graphs are an informative way of visualizing relationships between variables conditional on other variable(s). In R, one can make trellis plots either through the lattice package, or ggplot2. In this post, I'll show how to make such graphs in ggplot2.

I'm still going to use the Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Suppose we want to plot the relationship between GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic), conditional upon the region (Region8a variable).

The more common way to do so is using the lattice package in R:

library(lattice)
xyplot(Polity3 ~ log(GDP2000) | Region8a, data = norris.data,
xlab = "Logarithm of GDP (PPP annual, WB)",
ylab = "Polity Score (0=Full Authoritarianism, 20=Full Democracy)",
scales="free", pch=19, panel = function(x, y) {
panel.xyplot(x,y,pch=19, cex=0.5, col="black")
panel.lmline(x,y,col="red", lty=2)
})

The result is:

Note that the scales="free" option fits xlim and ylim for each region independently.  panel.xyplot() plots the points and  panel.lmline() fits the linear regression line for each region. Now, suppose we wanted to do this in ggplot. We are going to make use of the facet_wrap() command to create a trellis plot.

The syntax will be:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_smooth(method="lm", color="red")

The result is:

One could remove the confidence intervals by including se=FALSE within the geom_smooth() function. One could also force the regression lines to extrapolate beyond the data by including fullrange=TRUE within the geom_smooth() function. Finally, to change the color of the confidence interval from the default gray, we can specify including fill="blue", for example, within the geom_smooth() function. Note that in general, the geom_smooth() function is quite versatile and allows a variety of smoothing methods: you can specify lm, glm, gam, loess, or rlm (not exhaustive list) for the method argument.

An alternative, more flexible, way to do this is to first create a dataframe using the plyr package that contains the intercepts and slopes for the linear regression for each region. So instead of automatically passing down the aesthetics on to the geom_smooth() function as in the previous example, we can manually specify the aesthetics of the lines we want to draw using the geom_abline() function.

library(plyr)
reg.df <- ddply(norris.data, .(Region8a), function(i)
lm(Polity3 ~ log(GDP2000), data = i)\$coefficients[1:2])
names(reg.df) <- "intercept"
names(reg.df) <- "logGDP2000"

Then we make the trellis plot as:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_abline(aes(intercept = intercept, slope = logGDP2000), color = "red", data = reg.df)

A slightly different example of how to create trellis plots comes from the time series version of the Pippa Norris data. Suppose we want to look at the relationship between economic development and democracy over time just within easy country Central and Eastern Europe. One way to do so (as an exploratory tool at least) would be as follows:

CEE.lattice <- ggplot(norris.ts.CEE.data[norris.ts.CEE.data\$Year>=1980,], aes(logGDP_WB, Polity1, color=Year)) +
geom_path(color="black") +
geom_point(shape=20, size=3) +
facet_wrap(~Nation, ncol=5, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
opts(title="Relationship between Econ. Development and Democracy in Central and Eastern Europe")

The result is:

Note that scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") colors the points with a gradient from yellow to blue based on the year. I manually set the breaks in the gradient. Also, note that I use geom_path() to connect the points using the order of the observations. This is different than the geom_line() command, which would connect the points in order of the x-variable, which is GDP in this case (and not time).

Finally, and this holds across all of ggplot, if we wanted to make the background layer of the plots white instead of gray, we could simply pass the following argument to ggplot as a layer:

CEE.lattice + theme_bw()

The result is:

1 comment:

1. Very useful post. Thanks