8/01/2013

Bootstrap Confidence Interval Methods in R

This post briefly sketches out the types of bootstrapped confidence intervals commonly used, along with code in R for how to calculate them from scratch. Specifically, I focus on nonparametric confidence intervals. The post is structured around the list of bootstrap confidence interval methods provided by Canty et al. (1996). This is just a quick introduction into the world of bootstrapping - for an excellent R package in R for doing all sorts of bootstrapping, see the boot package by Brian Ripley.

Suppose that x_1, x_2, ..., x_n \sim f(\theta), where f(\cdot) is some arbitrary probability distribution. We are interested in estimating \theta. Let \hat{\theta} be the point estimate of the parameter of interest obtained from the original dataset.
Note that these are nonparameteric confidence intervals (that is, we don't assume that f is a particular parametric family).
The example comes from page 201 of DiCiccio and Efron (1996), which contains a dataset on five exams across 22 students (with some missing data). The parameter of interest is the maximum eigenvalue of the empirical covariance matrix. You may download the data here. The code for estimating the max eigenvalue is provided at the end of this post. The function is called calculate.max.eigen.
Using this function on the dataset, we obtain \hat{\theta}=216.2.
Now, suppose that we bootstrap the dataset M times and for each of the datasets, calculate \hat{\theta}^*_{m}, which is the point estimate of \theta for the mth bootstrapped dataset. The empirical distribution of \hat{\theta}^* is the bootstrap distribution and is an approximation to the sampling distribution for \hat{\theta}.

To create M=10000 bootstrap replicates in R:
BS.eigen <- function(data){
	index <- sample(1:nrow(data),size=nrow(data),replace=TRUE)
	bs.data <- data[index,]
	return(calculate.max.eigen(bs.data))
}
bs.sampling <- replicate(10000,BS.eigen(data),simplify=TRUE)
Here is the distribution of these estimates (with the median of the bootstrapped estimates denoted with the vertical blue line and \hat{\theta} denoted with a vertical red line):


We can then make use of the bootstrap distribution in a number of ways to obtained bootstrapped confidence intervals.
  • Normal intervals:
\hat{\theta} \pm z_{\alpha/2} \sigma_{bs},
where z_{\alpha/2} is the \alpha/2-quantile of the standard normal distribution and  \sigma_{bs} is the standard deviation of the bootstrap distribution. These intervals are justified by asymptotic theory (Central Limit Theorem).

In R:
theta.hat + qnorm(0.025)*sd(bs.sampling)
theta.hat + qnorm(0.975)*sd(bs.sampling)
For our example, we obtain a confidence interval of [202.64,1050.01].

  • Transformed normal intervals:
t^{-1}\{t(\hat{\theta}) \pm z_{\alpha/2} \sigma_{tbs}\},
where t(\cdot) is a variance-stabilizing square root transformation and \sigma_{tbs} is the standard deviation of the transformed bootstrap distribution (distribution of \hat{\theta}^*).

In R:
(sqrt(theta.hat) + qnorm(0.025)*sd(sqrt(bs.sampling)))^2
(sqrt(theta.hat) + qnorm(0.975)*sd(sqrt(bs.sampling)))^2
For our example, we obtain a confidence interval of [263.67, 1143.33].

  • Basic bootstrap intervals:
\left[2\hat{\theta}- \hat{\theta}^*_{m=(1-\alpha/2)M}, 2\hat{\theta}- \hat{\theta}^*_{m=(\alpha/2)M}\right]


In R:
2*theta.hat - quantile(bs.sampling,0.975)
2*theta.hat - quantile(bs.sampling,0.025)
For our example, we obtain a confidence interval of [186.45, 1018.62].


  • Transformed basic bootstrap intervals:basic bootstrap intervals with similar transformation to the transformed normal intervals.

In R:
(2*sqrt(theta.hat) - quantile(sqrt(bs.sampling),0.975))^2
(2*sqrt(theta.hat) - quantile(sqrt(bs.sampling),0.025))^2
For our example, we obtain a confidence interval of [302.76, 1208.00].


  • Percentile confidence intervals:
\left[\hat{\theta}^*_{m=(1-\alpha/2)M}, \hat{\theta}^*_{m=(\alpha/2)M}\right]


In R:
quantile(bs.sampling,0.975)
quantile(bs.sampling,0.025)
For our example, we obtain a confidence interval of [233.93, 1066.10].


  • BCa confidence intervals: refinement on the percentile confidence interval method, designed to increase accuracy. Note that BCa reduces to standard percentile confidence intervals if the bootstrap distribution is unbiased (median of the distribution is equal to the original point estimate) and the acceleration (skewness) is zero.
Step 1: Calculate the bias-correction \hat{z}_0, which gives the standard normal quantile function of the proportion of bootstrapped estimates less than the original point estimate:
\hat{z}_0 = \Phi^{-1} \left\{\dfrac{\sum \{\hat{\theta}^* < \hat{\theta} \}}{M} \right\}
In R:
z0 <- qnorm(mean(bs.sampling < theta.hat))
For our example, \hat{z}_0 is 0.194, which indicates a positive bias correction. This follows from the fact that 57.7% of \hat{\theta}^*s are below the original point estimate \hat{\theta} (downward bias). We can see this from the plot above since the median of the bootstrap distribution is to the left of the point estimate from the original dataset.

Step 2: Calculate the acceleration a, which has an interpretation of skewness (more specifically, it measures how rapidly standard error changes on a normalized scale). A non-parametric estimate of a is:
\hat{a} = \dfrac{1}{6}\dfrac{\sum_{i=1}^n (\hat{\theta}-\hat{\theta}_{(i)})^3}{(\sum_{i=1}^n (\hat{\theta}-\hat{\theta}_{(i)})^2)^{3/2}},
where \hat{\theta}_{(i)} is the jackknifed estimate of the parameter, obtained by omitting observation i from the data.

In R:
jk.theta <- sapply(1:nrow(data),function(x){
	jk.data <- data[-x,]
	return(calculate.max.eigen(jk.data))
})

a <- sum((theta.hat-jk.theta)^3)/(6*sum((theta.hat-jk.theta)^2)^(3/2))

For our example, we obtain a=0.103.

Step 3: The bias-corrected \alpha/2 endpoints for the percentile bootstrap confidence interval are then calculated as:
\hat{\theta}_{BC_a}[\alpha/2] =  \hat{G}^{-1} \left( \Phi \left(z_0 + \dfrac{z_0+z_{\alpha/2}}{1-a(z_0+z_{\alpha/2})} \right)\right),
where \hat{G}^{-1}(\cdot) is the quantile function of the bootstrap distribution.

In R:
q.lb <- pnorm(z0+(z0+qnorm(0.025))/(1-a*(z0+qnorm(0.025))))
q.ub <- pnorm(z0+(z0+qnorm(0.975))/(1-a*(z0+qnorm(0.975))))

quantile(bs.sampling,q.lb)
quantile(bs.sampling,q.ub)

We find that the BCa percentiles we should be using are actually 9.70% and 99.85% instead of 2.5% and 97.5% for a 95% confidence interval. Thus, we obtain a confidence interval of [322.0132,1320.345].


References:
Canty, A.J., A.C. Davison and D.V. Hinkley. (1996). "Comment on 'Bootstrap Confidence Intervals'."Statistical Science 11(3):214-219.
DiCiccio, Thomas J. and Bradley Efron. (1996). "Bootstrap Confidence Intervals." Statistical Science 11(3):189-228.
Efron, B. and Tibshirani R. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York.
Hall. (1988). "Theoretical comparison of bootstrap confidence intervals (with discussion)." Ann. Statist. 16: 927-985.


Code for calculating maximum eigenvalue:
library(reshape2)
calculate.max.eigen <- function(df){
	n <- nrow(df)
	k <- ncol(df)
	df.melt <- melt(df)
	colnames(df.melt) <- c("test","score")
	df.melt[,3] <- as.factor(rep(1:n,times=k))
    colnames(df.melt)[3] <- "student"
	lm.out <- lm(score~test+student,data=df.melt)
	score.hat <- predict(lm.out, df.melt)
	data.hat <- matrix(data=score.hat, nrow=n, ncol=k, byrow=FALSE)
	demeaned.hat <- apply(data.hat,MARGIN=1, function(x) x-colMeans(data.hat))
	return(max(eigen(demeaned.hat%*%t(demeaned.hat)/n, only.values=TRUE)$values))
}


Code for density plot:
library(ggplot2)

df <- data.frame(bs.theta=bs.sampling)
ggplot(df, aes(x=bs.theta)) + 
geom_density(alpha=.2, fill="coral", color="steelblue") +
theme_bw() +
geom_vline(aes(xintercept=theta.hat), color="red", size=1) +
geom_vline(aes(xintercept=median(bs.sampling)), color="steelblue", size=1) +
scale_x_continuous("Bootstrapped Estimate of Max Eigenvalue") +
scale_y_continuous("Density")
}

7/10/2013

Plotting regression coefficients with confidence intervals in ggplot2

A graphical approach to displaying regression coefficients / effect sizes across multiple specifications can often be significantly more powerful and intuitive than presenting a regression table. Moreover, we can easily express uncertainty in the form of confidence intervals around our estimates. As a quick example, suppose that we wanted to compare the effect of British colonial status upon country-level corruption across multiple specifications and two methods (OLS and WLS) from the following paper: Treisman, Daniel. 2000. "The causes of corruption: a cross-national study," Journal of Public Economics 76: 399-457.

Specifically, the dependent variable is TI98, the perceived corruption score calculated by Transparency International for 1998. The variable whose effect we seek is an indicator that equals 1 if the country is a former British colony or the UK, and 0 otherwise. I took the coefficients and associated standard errors on British colonial status from Tables 2 and 3 across the 5 different specifications where TI98 is the dependent variable. I then entered them into a data frame with the following structure:

      coef    se  method     specification       lb                ub
1  -1.99   1.01     WLS             1           -3.969564 -0.01043638
2  -1.56  0.59    WLS             2           -2.716379 -0.40362125
3  -1.25   0.52    WLS             3           -2.269181 -0.23081873
4  -1.20  0.54    WLS             4           -2.258381 -0.14161945
5  -1.04  0.79    WLS             5           -2.588372  0.50837155
6  -1.25  0.81      OLS             1           -2.837571  0.33757083
7  -1.08  0.54     OLS             2           -2.138381 -0.02161945
8  -0.98  0.53    OLS             3           -2.018781  0.05878091
9  -0.82  0.58    OLS             4           -1.956779  0.31677911
10 -1.06  0.96    OLS             5           -2.941565  0.82156543

Note that I calculated the upper bound (ub) and lower bound (lb) of the 95% confidence interval using the standard errors provided in the table (I assumed normality holds due to the Central Limit Theorem, which may be questionable in some specifications given small sample sizes).

I then generated the following plot:


Here is the code for making this plot in ggplot2 from the dataframe I provided above:
pd <- position_dodge(width=0.2,height=NULL)

ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
scale_shape_manual(name="Method",values=c(17,19)) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98") +
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)
The geom_errorbar() function plots the confidence intervals. Note that I use the position_dodge() function to horizontally shift the coefficients and confidence intervals for the same specifications for clarity. The height=NULL option can be omitted. The color and shape for the legend is controlled manually.

What would happen if I just set name="Method" for the scale_color_manual command, but left out the scale_shape_manual command, letting it be automatically determined:
ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98")   + 
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)
This would be the plot:



This happens because I also set the shape of the points to be determined by the method variable, just as for color. I thus I need to manually give the same name to both scales, or else otherwise they are automatically broken up into two legends, one manual titled "Method" and one automatic title "method".

What if I wanted to reorder the ordering of the methods in the plot; that is, if we wanted WLS to be plotted first, then OLS?

This can be achieved with the following command before running the first block of code on this page.
df$method <- reorder(df$method,rep(1:2,each=5))
The result is the following:

Finally, suppose that we wanted to customize the x-axis labels by tilting them diagonally and changing them to a dark grey. Adding the following extra piece of code to the blocks of code above would accomplish that:
+ theme(axis.text.x=element_text(angle=45,color="darkgray"))

6/20/2013

Drawing DAGs: R Solution


Following up on the previous post, another way to construct DAGs is using R. I think the igraph package is one of the customizable ways to do so. This is a powerful package designed for the visualization and analysis of networks and offers much more functionality than you will use for DAGs.

The R equivalent to the instrumental variables DAG constructed in TeX in the preceding post is the following:
To construct this in R, first you want to create an igraph object (with directed edges) as follows:
g1 <- graph.formula(Z--+T, T--+Y,U--+T, U--+Y, U--+Z, Z--+Y)
If you just view the g1 object (by running g1), you will see the following:
IGRAPH DN-- 4 6 -- 
+ attr: name (v/c)
This denotes that this is a directed named (DN) network with 4 vertices (nodes) and 6 edges. The names of the vertices are stored as an attribute (and correspond to the variable names: Z, T, U, Y). Note that we can access the vertex name attribute as V(g1)$name. Running this will help us identify the order in which the vertices were entered into the igraph object:
> V(g1)$name
[1] "Z" "A" "Y" "U"
To give the vertices custom attributes:
V(g1)$label <- V(g1)$name
V(g1)$color <- c(rep("black",3),"white") 
V(g1)$size <- 7
V(g1)$label.cex <- 1.5
V(g1)$label.dist <- 2
V(g1)$label.color <- "black"
Now the edges (first see the edge sequence by running E(g1)):
E(g1)$color <- "black"
E(g1)$color[2] <- E(g1)$color[4] <- "red"
E(g1)$lty <- c(1,2,1,2,1,1)
E(g1)$label <- c(NA,"X",NA,"X",NA,NA)
E(g1)$label.color <- "red" 
E(g1)$label.cex <- 1.5
If you run plot(g1) now, you will note that the igraph library automatically arranges the vertices. To manually override this arrangement, run the following command:
tkplot(g1)
This opens an interactive graphing screen that allows you to manually move around the vertices and arrange them as you want. Now, while leaving the popup window open, extract the new coordinates for the vertices as:
coord.g1 <- tkplot.getcoords(1) 
(Note that the 1 refers to the number of the interactive window -- this one was the first I opened in this R session).

Now, pass these coordinates to the plotting of the igraph object (along with moving around the label location relative to the vertices):
plot(g1, layout=coord.g1,vertex.label.degree=c(pi,pi/2, 0, -pi/2)

Drawing DAGs: LaTeX Solution

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships between variables across a wide variety of disciplines. For an excellent (and quite accessible) textbook on the topic, please see the book Causal Inference by Miguel Hernan and Jamie Robins.

In this post, I briefly explore how you can draw DAGs in LaTeX. In the subsequent post, I will show how to draw DAGs using R.

The first example of DAG is the common instrumental variable (IV) setup:


We seek to identify the effect of treatment T on outcome Y. However, this is confounded by an unmeasured variable U. The IV is denoted as Z. Technically, we do not need to put in the crossed out red edges between U and Z and Z and Y - absence of edges encodes independence relations. However, I included them to reinforce (and make explicit) the assumptions needed for identification of causal effects using IVs:
  • (Quasi)-exogeneity of the instrument (no path from U to Z)
  • Exclusion restriction (no direct path from Z to Y)
This was made using the TikZ package in LaTeX. I used the \usepackage{pgf,tikz} command at the beginning of my document.

The code to create the DAG above is:
\begin{tikzpicture}
% nodes %
\node[text centered] (z) {$Z$};
\node[right = 1.5 of z, text centered] (t) {$T$};
\node[right=1.5 of t, text centered] (y) {$Y$};
\node[draw, rectangle, dashed, above = 1 of t, text centered] (u) {$U$};

% edges %
\draw[->, line width= 1] (z) --  (t);
\draw [->, line width= 1] (t) -- (y);
\draw[->,red, line width= 1,dashed] (u) --node {X} (z);
\draw[->,line width= 1] (u) --(t);
\draw[->,line width= 1] (u) -- (y);
\draw[->, red, line width=1,dashed] (z) to  [out=270,in=270, looseness=0.5] node{X} (y);
\end{tikzpicture}

Note that I first create the nodes (corresponding to variables in the DAG), and then draw the directed edges between the nodes.

Another example of a DAG is a simple structural equation model where we want each edge to be marked with the parameter signifying the causal effect. For example:

In LaTeX:
\begin{tikzpicture}
% nodes %
\node[text centered] (t) {$T$};
\node[right = 1.5 of t, text centered] (m) {$M$};
\node[right=1.5 of m, text centered] (y) {$Y$};

% edges %
\draw[->, line width= 1] (t) -- node[above,font=\footnotesize]{$\beta$}  (m);
\draw [->, line width= 1] (m) -- node[above,font=\footnotesize]{$\gamma$}  (y);
\draw[->, line width=1] (t) to  [out=270,in=270, looseness=0.5] node[below,font=\footnotesize]{$\delta$} (y);
\end{tikzpicture}