Code à la Mode: ggplot2

Showing posts with label ggplot2. Show all posts

7/10/2013

Plotting regression coefficients with confidence intervals in ggplot2

A graphical approach to displaying regression coefficients / effect sizes across multiple specifications can often be significantly more powerful and intuitive than presenting a regression table. Moreover, we can easily express uncertainty in the form of confidence intervals around our estimates. As a quick example, suppose that we wanted to compare the effect of British colonial status upon country-level corruption across multiple specifications and two methods (OLS and WLS) from the following paper: Treisman, Daniel. 2000. "The causes of corruption: a cross-national study," Journal of Public Economics 76: 399-457.

Specifically, the dependent variable is TI98, the perceived corruption score calculated by Transparency International for 1998. The variable whose effect we seek is an indicator that equals 1 if the country is a former British colony or the UK, and 0 otherwise. I took the coefficients and associated standard errors on British colonial status from Tables 2 and 3 across the 5 different specifications where TI98 is the dependent variable. I then entered them into a data frame with the following structure:

coef se method specification lb ub

1 -1.99 1.01 WLS 1 -3.969564 -0.01043638

2 -1.56 0.59 WLS 2 -2.716379 -0.40362125

3 -1.25 0.52 WLS 3 -2.269181 -0.23081873

4 -1.20 0.54 WLS 4 -2.258381 -0.14161945

5 -1.04 0.79 WLS 5 -2.588372 0.50837155

6 -1.25 0.81 OLS 1 -2.837571 0.33757083

7 -1.08 0.54 OLS 2 -2.138381 -0.02161945

8 -0.98 0.53 OLS 3 -2.018781 0.05878091

9 -0.82 0.58 OLS 4 -1.956779 0.31677911

10 -1.06 0.96 OLS 5 -2.941565 0.82156543

Note that I calculated the upper bound (ub) and lower bound (lb) of the 95% confidence interval using the standard errors provided in the table (I assumed normality holds due to the Central Limit Theorem, which may be questionable in some specifications given small sample sizes).

I then generated the following plot:

Here is the code for making this plot in ggplot2 from the dataframe I provided above:

pd <- position_dodge(width=0.2,height=NULL)

ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
scale_shape_manual(name="Method",values=c(17,19)) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98") +
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)

The geom_errorbar() function plots the confidence intervals. Note that I use the position_dodge() function to horizontally shift the coefficients and confidence intervals for the same specifications for clarity. The height=NULL option can be omitted. The color and shape for the legend is controlled manually.

What would happen if I just set name="Method" for the scale_color_manual command, but left out the scale_shape_manual command, letting it be automatically determined:

ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98")   + 
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)

This would be the plot:

This happens because I also set the shape of the points to be determined by the method variable, just as for color. I thus I need to manually give the same name to both scales, or else otherwise they are automatically broken up into two legends, one manual titled "Method" and one automatic title "method".

What if I wanted to reorder the ordering of the methods in the plot; that is, if we wanted WLS to be plotted first, then OLS?

This can be achieved with the following command before running the first block of code on this page.

df$method <- reorder(df$method,rep(1:2,each=5))

The result is the following:

Finally, suppose that we wanted to customize the x-axis labels by tilting them diagonally and changing them to a dark grey. Adding the following extra piece of code to the blocks of code above would accomplish that:

+ theme(axis.text.x=element_text(angle=45,color="darkgray"))

4/07/2012

Plotting side-by-side in ggplot2

Here's a quick example of plotting histograms next to one another in ggplot2. I wanted to plot the estimated propensity scores for treated and control units for the Lalonde non-experimental data.

First, generate the estimate propensity scores using the fitted values from a logit model where the dependent variable is the treatment indicator. The dataframe (stripped of the outcome) is called dta.nooutcome.

glm1  <- glm(TREAT~MARR+NODEGREE+BLACK+HISPANIC+EDUC+AGE+RE74+RE75+U74+U75, family=binomial, data=dta.nooutcome)
pscore1 <- glm1$fitted

Specifically, we can make two histograms for the distribution of the propensity scores corresponding to the two treatment levels.

plot1<- ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==1]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Active Treatment Units")

plot2<-ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==0]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="blue") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Control Treatment Units")

To arrange the two plots next to each other, we can use the gridExtra library.

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that one can pass the number of rows and columns to the grid.layout() option. Without explicitly specifying width and height of each cell, Viewport automatically arranges the plots.

The result is:

If we want to add numbers to the top of each bar to make it easier to see the number of units in each propensity score bucket, we can modify the existing plots as follows:

plot1a <- plot1 + 
geom_text(aes(x=ggplot_build(plot1)$data[[1]]$x, y=ggplot_build(plot1)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot1)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")
plot2a <- plot2 + 
geom_text(aes(x=ggplot_build(plot2)$data[[1]]$x, y=ggplot_build(plot2)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot2)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1a, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2a, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that we extract the counts of units in each of the histogram bins using the ggplot_build() function and then subsetting the data. Alternatively, one could go about this whole exercise by first manually putting together a dataframe with the bins and the associated counts and passing it to the geom_bar() and geom_text() functions within the ggplot package.

The result is:

Density Plots and Histograms in ggplot2

In this post, I'm going to go through how to make plots of distributions (either density plots or histograms) in ggplot2. I'm going to draw upon examples of Fisherian testing in the context of causal inference, but the examples should be completely understandable without knowledge of Fisher's approach to inference.

For example, suppose that we want to plot the randomization distribution of a test statistic (difference in means in this case) under the sharp null of zero treatment effect across all units. The possible values of the test statistic across 1000 randomly sampled randomizations are in the vector test.stat. Moreover, the actually observed test statistic is tau.hat. Not only do we want to plot the density, but we want to shade in the area under the density that corresponds to the one-sided p-value of this test statistic.

We can first turn the vector into a data frame that contains the values as x and the corresponding density as y:

dd <- with(density(test.stat),data.frame(x,y))

Then plot in ggplot:

ggplot(data = dd, mapping = aes(x = x, y = y)) +
geom_line(color="black") + layer(data = dd, mapping = aes(x=ifelse(x < tau.hat,x,tau.hat), y=y), geom = "area", geom_params=list(fill="red",alpha=.3)) + 
scale_y_continuous(limits = c(0,max(dd$y)), name="Density") + 
scale_x_continuous(name="Difference in Means") + 
geom_vline(aes(xintercept=tau.hat.1a), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-42, y2=0.015, text2="T[obs]"), color=I("red"), parse=TRUE) + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-60, y2=0.0025, text2="p-value = 0.041"), size=3)

Several comments on the syntax passed to the ggplot() function above:

The layer() function with the geom="area" option fills the area under the distribution corresponding to the one-sided p-value (probability of observing a value of the test statistic at least as small as the one we observe).
I manually pass the location of the text into the geom_text() function in this case by creating a data frame within the actual geom_text() function.
Note that in order for ggplot to write expressions with text, such as subscripts which I use above, I pass the parse=TRUE option to geom_text()

Here's another example that involves plotting the distribution of p-values under the sharp null of a constant additive treatment effect for each unit. I wanted to plot the values of the resultant test statistics on the x-axis, the p-values for the sharp null on the y-axis, and then shade the 95% Fisherian confidence interval. In this case, tau.vector contains the values of the test statistic (difference in means). p.vector is a corresponding vector that contains the one-sided exact Fisher p-value for the sharp null hypothesis that the effect is 0. Note that I placed a dotted horizontal red line at a p-value of 0.025 to obtain approximately 95% coverage on my confidence interval.

ggplot(data = CI.2c, mapping = aes(x = tau.vector, y = p.vector)) +
geom_line(color="black") + 
scale_y_continuous(limits = c(0,max(CI.2c$p.vector)),name="p-value") +
scale_x_continuous(name="Difference in Means") + 
layer(data = CI.2c, mapping = aes(x=ifelse((lb.2c < tau.vector & tau.vector < ub.2c),tau.vector,lb.2c), y=p.vector), geom = "area", geom_params=list(fill="yellow",alpha=.3)) + 
geom_hline(aes(yintercept=0.025), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-100, y2=0.1, text2="95% CI: [-83,6.33]"), size=5)

Here's a very simple histogram example. Suppose that pval.5 contains a vector of p-values across all possible assignment vectors. We want to plot the distribution of Fisher's exact p-values, which should be uniformly distributed between 0 and 1 under the null (and generally when dealing with continuous test statistics). We can do so simply using ggplot as:

ggplot() + 
geom_histogram(aes(x = pval.5), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") +
scale_y_continuous(name="Count") + 
scale_x_continuous(name="p-value")

4/02/2012

Factor Orderings in ggplot2

One of the occasionally annoying features of R and thus ggplot2 is dealing with factors. In this post, I'll go through how to handle ordering of factors in ggplot2 and the manual assignment of colors to those categories.

For this example, I'm going to use data collected by Daniel Ziblatt on the European revolutions of 1848. Specifically, for each county in the data, we can code the county as having a revolution, having a constitutional concession, or no revolution. For each county, we also know distance from Paris (in kilometers) and the price shock in wheat in 1847. These are meant to test arguments made about the locations of of the 1848 revolutions: that countries that experienced them were closer to Paris and the French Revolution (diffusion argument) and/or revolutions were the result of economic hardships.

An initial attempt at plotting this may look something like this:

ggplot(my.data, aes(Distance, PriceSpike, color=Revolution)) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) +  
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") + 
theme_bw()  +
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))

The result is:

Note that the factor variable that we are plotting is Revolution and ggplot2 automatically orders it in the legend (alphabetically) and assigns it default colors. Also note that I manually positioned the legend within the plotting space.

What if we wanted "Revolution" to come first in the legend, followed by "Constitutional Concession", then "No Revolution"? Moreover, what if we wanted the color for "Revolution" to be red and the color for "No Revolution" to be green?

We can create an additional vector of the length of the number of observations in the my.data dataframe called Revolution.order:

Revolution.order <- my.data$Revolution
Revolution.order[Revolution.order=="Revolution"] <- 1
Revolution.order[Revolution.order=="Constitutional Concession"] <- 2
Revolution.order[Revolution.order=="No Revolution"] <- 3
Revolution.order <- as.integer(Revolution.order)

This vector now contains the ordering of the factor variable which we can feed into ggplot through the reorder() function:

ggplot(my.data, aes(Distance, PriceSpike, color=reorder(Revolution,Revolution.order))) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) + 
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") +  
theme_bw() + 
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
scale_color_manual(name="", values=c("#D7301F", "#0570B0", "#2CA25F")) + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))

Note that we can manually supply the colors for the factor through scale_color_manual().

The result is:

2/25/2012

Trellis graphs in ggplot2

Trellis graphs are an informative way of visualizing relationships between variables conditional on other variable(s). In R, one can make trellis plots either through the lattice package, or ggplot2. In this post, I'll show how to make such graphs in ggplot2.

I'm still going to use the Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Suppose we want to plot the relationship between GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic), conditional upon the region (Region8a variable).

The more common way to do so is using the lattice package in R:

library(lattice)
xyplot(Polity3 ~ log(GDP2000) | Region8a, data = norris.data, 
xlab = "Logarithm of GDP (PPP annual, WB)", 
ylab = "Polity Score (0=Full Authoritarianism, 20=Full Democracy)", 
scales="free", pch=19, panel = function(x, y) {
panel.xyplot(x,y,pch=19, cex=0.5, col="black")
panel.lmline(x,y,col="red", lty=2)
})

The result is:

Note that the scales="free" option fits xlim and ylim for each region independently. panel.xyplot() plots the points and panel.lmline() fits the linear regression line for each region. Now, suppose we wanted to do this in ggplot. We are going to make use of the facet_wrap() command to create a trellis plot.

The syntax will be:

ggplot(norris.data, aes(log(GDP2000), Polity3)) +  
geom_point(shape=20, size=3) + 
facet_wrap(~Region8a, ncol=4, scales="free") + 
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") + 
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_smooth(method="lm", color="red")

The result is:

One could remove the confidence intervals by including se=FALSE within the geom_smooth() function. One could also force the regression lines to extrapolate beyond the data by including fullrange=TRUE within the geom_smooth() function. Finally, to change the color of the confidence interval from the default gray, we can specify including fill="blue", for example, within the geom_smooth() function. Note that in general, the geom_smooth() function is quite versatile and allows a variety of smoothing methods: you can specify lm, glm, gam, loess, or rlm (not exhaustive list) for the method argument.

An alternative, more flexible, way to do this is to first create a dataframe using the plyr package that contains the intercepts and slopes for the linear regression for each region. So instead of automatically passing down the aesthetics on to the geom_smooth() function as in the previous example, we can manually specify the aesthetics of the lines we want to draw using the geom_abline() function.

library(plyr)
reg.df <- ddply(norris.data, .(Region8a), function(i)
               lm(Polity3 ~ log(GDP2000), data = i)$coefficients[1:2])
names(reg.df)[2] <- "intercept"
names(reg.df)[3] <- "logGDP2000"

Then we make the trellis plot as:

ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_abline(aes(intercept = intercept, slope = logGDP2000), color = "red", data = reg.df)

A slightly different example of how to create trellis plots comes from the time series version of the Pippa Norris data. Suppose we want to look at the relationship between economic development and democracy over time just within easy country Central and Eastern Europe. One way to do so (as an exploratory tool at least) would be as follows:

CEE.lattice <- ggplot(norris.ts.CEE.data[norris.ts.CEE.data$Year>=1980,], aes(logGDP_WB, Polity1, color=Year)) +
geom_path(color="black") +
geom_point(shape=20, size=3) +
facet_wrap(~Nation, ncol=5, scales="free") +
scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
opts(title="Relationship between Econ. Development and Democracy in Central and Eastern Europe")

The result is:

Note that scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") colors the points with a gradient from yellow to blue based on the year. I manually set the breaks in the gradient. Also, note that I use geom_path() to connect the points using the order of the observations. This is different than the geom_line() command, which would connect the points in order of the x-variable, which is GDP in this case (and not time).

Finally, and this holds across all of ggplot, if we wanted to make the background layer of the plots white instead of gray, we could simply pass the following argument to ggplot as a layer:

CEE.lattice + theme_bw()

The result is:

2/12/2012

Better graphics in R: intro to ggplot2

I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of ggplot2 as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is lattice).

Let's look at the "grammar" of ggplot2. The 2 basic types of input into ggplot2 are:

Aesthetics: x position, y position, shape / size / color of elements
Elements / geometric shapes: points, lines, bars, text, etc.

A ggplot is invoked using the ggplot2() command. The aesthetics are typically passed to the plotting function using the aes() function. The logic behind ggplot2 is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of ggplot2 will become apparent in the examples.

There is also a quick plotting function withing ggplot2 known as qplot(). It is an easier place to start because it is meant to resemble the syntax of the plot() command within the base package in R. Let's start with one example of qplot() and then shift to ggplot() since the functionality is so much richer in the latter.

I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.

We can create a simple scatterplot as follows:

qplot(log(GDP2000), Polity3, data=norris.data, 
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)", 
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")

This syntax should be quite similar to that of plot(), the only real difference being that we use size instead of cex to control point size, use an I() around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify alpha=.75 to create some transparency in the points. These are ggplot aesthetic names and they differ from those in plot(). Also note that we must add + opts(legend.position="none") to remove the legend that would otherwise show up to the right of the plot. The result is:

Now, we can make the exact same graph, but using ggplot():

plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")

In this case, note how we first define plot2 as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call plot2 and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.

We can add a regression line to the previous plot as yet another layer as follows:

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
 + scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_abline(color="red")

Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:

ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
scale_color_discrete(name = "Regions")

Note that specifically we passed color=Region8a as an argument to the initial aes() function. scale_color_discrete(name = "Regions" was invoked to change the name of the legend.

We can add country names just below the points in the previous plot as follows (again, separate layer):

ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) + 
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")

If we just want country names and no points:

ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+ 
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+  
scale_color_discrete(name = "Regions")

Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying size=Aid2002 in the original aesthetics:

ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")