Code à la Mode: April 2012

4/07/2012

Plotting side-by-side in ggplot2

Here's a quick example of plotting histograms next to one another in ggplot2. I wanted to plot the estimated propensity scores for treated and control units for the Lalonde non-experimental data.

First, generate the estimate propensity scores using the fitted values from a logit model where the dependent variable is the treatment indicator. The dataframe (stripped of the outcome) is called dta.nooutcome.

glm1  <- glm(TREAT~MARR+NODEGREE+BLACK+HISPANIC+EDUC+AGE+RE74+RE75+U74+U75, family=binomial, data=dta.nooutcome)
pscore1 <- glm1$fitted

Specifically, we can make two histograms for the distribution of the propensity scores corresponding to the two treatment levels.

plot1<- ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==1]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Active Treatment Units")

plot2<-ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==0]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="blue") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Control Treatment Units")

To arrange the two plots next to each other, we can use the gridExtra library.

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that one can pass the number of rows and columns to the grid.layout() option. Without explicitly specifying width and height of each cell, Viewport automatically arranges the plots.

The result is:

If we want to add numbers to the top of each bar to make it easier to see the number of units in each propensity score bucket, we can modify the existing plots as follows:

plot1a <- plot1 + 
geom_text(aes(x=ggplot_build(plot1)$data[[1]]$x, y=ggplot_build(plot1)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot1)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")
plot2a <- plot2 + 
geom_text(aes(x=ggplot_build(plot2)$data[[1]]$x, y=ggplot_build(plot2)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot2)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1a, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2a, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that we extract the counts of units in each of the histogram bins using the ggplot_build() function and then subsetting the data. Alternatively, one could go about this whole exercise by first manually putting together a dataframe with the bins and the associated counts and passing it to the geom_bar() and geom_text() functions within the ggplot package.

The result is:

Density Plots and Histograms in ggplot2

In this post, I'm going to go through how to make plots of distributions (either density plots or histograms) in ggplot2. I'm going to draw upon examples of Fisherian testing in the context of causal inference, but the examples should be completely understandable without knowledge of Fisher's approach to inference.

For example, suppose that we want to plot the randomization distribution of a test statistic (difference in means in this case) under the sharp null of zero treatment effect across all units. The possible values of the test statistic across 1000 randomly sampled randomizations are in the vector test.stat. Moreover, the actually observed test statistic is tau.hat. Not only do we want to plot the density, but we want to shade in the area under the density that corresponds to the one-sided p-value of this test statistic.

We can first turn the vector into a data frame that contains the values as x and the corresponding density as y:

dd <- with(density(test.stat),data.frame(x,y))

Then plot in ggplot:

ggplot(data = dd, mapping = aes(x = x, y = y)) +
geom_line(color="black") + layer(data = dd, mapping = aes(x=ifelse(x < tau.hat,x,tau.hat), y=y), geom = "area", geom_params=list(fill="red",alpha=.3)) + 
scale_y_continuous(limits = c(0,max(dd$y)), name="Density") + 
scale_x_continuous(name="Difference in Means") + 
geom_vline(aes(xintercept=tau.hat.1a), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-42, y2=0.015, text2="T[obs]"), color=I("red"), parse=TRUE) + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-60, y2=0.0025, text2="p-value = 0.041"), size=3)

Several comments on the syntax passed to the ggplot() function above:

The layer() function with the geom="area" option fills the area under the distribution corresponding to the one-sided p-value (probability of observing a value of the test statistic at least as small as the one we observe).
I manually pass the location of the text into the geom_text() function in this case by creating a data frame within the actual geom_text() function.
Note that in order for ggplot to write expressions with text, such as subscripts which I use above, I pass the parse=TRUE option to geom_text()

Here's another example that involves plotting the distribution of p-values under the sharp null of a constant additive treatment effect for each unit. I wanted to plot the values of the resultant test statistics on the x-axis, the p-values for the sharp null on the y-axis, and then shade the 95% Fisherian confidence interval. In this case, tau.vector contains the values of the test statistic (difference in means). p.vector is a corresponding vector that contains the one-sided exact Fisher p-value for the sharp null hypothesis that the effect is 0. Note that I placed a dotted horizontal red line at a p-value of 0.025 to obtain approximately 95% coverage on my confidence interval.

ggplot(data = CI.2c, mapping = aes(x = tau.vector, y = p.vector)) +
geom_line(color="black") + 
scale_y_continuous(limits = c(0,max(CI.2c$p.vector)),name="p-value") +
scale_x_continuous(name="Difference in Means") + 
layer(data = CI.2c, mapping = aes(x=ifelse((lb.2c < tau.vector & tau.vector < ub.2c),tau.vector,lb.2c), y=p.vector), geom = "area", geom_params=list(fill="yellow",alpha=.3)) + 
geom_hline(aes(yintercept=0.025), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-100, y2=0.1, text2="95% CI: [-83,6.33]"), size=5)

Here's a very simple histogram example. Suppose that pval.5 contains a vector of p-values across all possible assignment vectors. We want to plot the distribution of Fisher's exact p-values, which should be uniformly distributed between 0 and 1 under the null (and generally when dealing with continuous test statistics). We can do so simply using ggplot as:

ggplot() + 
geom_histogram(aes(x = pval.5), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") +
scale_y_continuous(name="Count") + 
scale_x_continuous(name="p-value")

4/02/2012

Factor Orderings in ggplot2

One of the occasionally annoying features of R and thus ggplot2 is dealing with factors. In this post, I'll go through how to handle ordering of factors in ggplot2 and the manual assignment of colors to those categories.

For this example, I'm going to use data collected by Daniel Ziblatt on the European revolutions of 1848. Specifically, for each county in the data, we can code the county as having a revolution, having a constitutional concession, or no revolution. For each county, we also know distance from Paris (in kilometers) and the price shock in wheat in 1847. These are meant to test arguments made about the locations of of the 1848 revolutions: that countries that experienced them were closer to Paris and the French Revolution (diffusion argument) and/or revolutions were the result of economic hardships.

An initial attempt at plotting this may look something like this:

ggplot(my.data, aes(Distance, PriceSpike, color=Revolution)) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) +  
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") + 
theme_bw()  +
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))

The result is:

Note that the factor variable that we are plotting is Revolution and ggplot2 automatically orders it in the legend (alphabetically) and assigns it default colors. Also note that I manually positioned the legend within the plotting space.

What if we wanted "Revolution" to come first in the legend, followed by "Constitutional Concession", then "No Revolution"? Moreover, what if we wanted the color for "Revolution" to be red and the color for "No Revolution" to be green?

We can create an additional vector of the length of the number of observations in the my.data dataframe called Revolution.order:

Revolution.order <- my.data$Revolution
Revolution.order[Revolution.order=="Revolution"] <- 1
Revolution.order[Revolution.order=="Constitutional Concession"] <- 2
Revolution.order[Revolution.order=="No Revolution"] <- 3
Revolution.order <- as.integer(Revolution.order)

This vector now contains the ordering of the factor variable which we can feed into ggplot through the reorder() function:

ggplot(my.data, aes(Distance, PriceSpike, color=reorder(Revolution,Revolution.order))) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) + 
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") +  
theme_bw() + 
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
scale_color_manual(name="", values=c("#D7301F", "#0570B0", "#2CA25F")) + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))

Note that we can manually supply the colors for the factor through scale_color_manual().

The result is: