7/10/2013

Plotting regression coefficients with confidence intervals in ggplot2

A graphical approach to displaying regression coefficients / effect sizes across multiple specifications can often be significantly more powerful and intuitive than presenting a regression table. Moreover, we can easily express uncertainty in the form of confidence intervals around our estimates. As a quick example, suppose that we wanted to compare the effect of British colonial status upon country-level corruption across multiple specifications and two methods (OLS and WLS) from the following paper: Treisman, Daniel. 2000. "The causes of corruption: a cross-national study," Journal of Public Economics 76: 399-457.

Specifically, the dependent variable is TI98, the perceived corruption score calculated by Transparency International for 1998. The variable whose effect we seek is an indicator that equals 1 if the country is a former British colony or the UK, and 0 otherwise. I took the coefficients and associated standard errors on British colonial status from Tables 2 and 3 across the 5 different specifications where TI98 is the dependent variable. I then entered them into a data frame with the following structure:

      coef    se  method     specification       lb                ub
1  -1.99   1.01     WLS             1           -3.969564 -0.01043638
2  -1.56  0.59    WLS             2           -2.716379 -0.40362125
3  -1.25   0.52    WLS             3           -2.269181 -0.23081873
4  -1.20  0.54    WLS             4           -2.258381 -0.14161945
5  -1.04  0.79    WLS             5           -2.588372  0.50837155
6  -1.25  0.81      OLS             1           -2.837571  0.33757083
7  -1.08  0.54     OLS             2           -2.138381 -0.02161945
8  -0.98  0.53    OLS             3           -2.018781  0.05878091
9  -0.82  0.58    OLS             4           -1.956779  0.31677911
10 -1.06  0.96    OLS             5           -2.941565  0.82156543

Note that I calculated the upper bound (ub) and lower bound (lb) of the 95% confidence interval using the standard errors provided in the table (I assumed normality holds due to the Central Limit Theorem, which may be questionable in some specifications given small sample sizes).

I then generated the following plot:


Here is the code for making this plot in ggplot2 from the dataframe I provided above:
pd <- position_dodge(width=0.2,height=NULL)

ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
scale_shape_manual(name="Method",values=c(17,19)) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98") +
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)
The geom_errorbar() function plots the confidence intervals. Note that I use the position_dodge() function to horizontally shift the coefficients and confidence intervals for the same specifications for clarity. The height=NULL option can be omitted. The color and shape for the legend is controlled manually.

What would happen if I just set name="Method" for the scale_color_manual command, but left out the scale_shape_manual command, letting it be automatically determined:
ggplot(treisman, aes(specification,coef, color=method)) +
geom_point(aes(shape=method),size=4, position=pd) + 
scale_color_manual(name="Method",values=c("coral","steelblue")) + 
theme_bw() + 
scale_x_continuous("Specification", breaks=1:length(specification), labels=specification) + 
scale_y_continuous("Estimated effect of being a former British colony or the UK on TI98")   + 
geom_errorbar(aes(ymin=lb,ymax=ub),width=0.1,position=pd)
This would be the plot:



This happens because I also set the shape of the points to be determined by the method variable, just as for color. I thus I need to manually give the same name to both scales, or else otherwise they are automatically broken up into two legends, one manual titled "Method" and one automatic title "method".

What if I wanted to reorder the ordering of the methods in the plot; that is, if we wanted WLS to be plotted first, then OLS?

This can be achieved with the following command before running the first block of code on this page.
df$method <- reorder(df$method,rep(1:2,each=5))
The result is the following:

Finally, suppose that we wanted to customize the x-axis labels by tilting them diagonally and changing them to a dark grey. Adding the following extra piece of code to the blocks of code above would accomplish that:
+ theme(axis.text.x=element_text(angle=45,color="darkgray"))

6/20/2013

Drawing DAGs: R Solution


Following up on the previous post, another way to construct DAGs is using R. I think the igraph package is one of the customizable ways to do so. This is a powerful package designed for the visualization and analysis of networks and offers much more functionality than you will use for DAGs.

The R equivalent to the instrumental variables DAG constructed in TeX in the preceding post is the following:
To construct this in R, first you want to create an igraph object (with directed edges) as follows:
g1 <- graph.formula(Z--+T, T--+Y,U--+T, U--+Y, U--+Z, Z--+Y)
If you just view the g1 object (by running g1), you will see the following:
IGRAPH DN-- 4 6 -- 
+ attr: name (v/c)
This denotes that this is a directed named (DN) network with 4 vertices (nodes) and 6 edges. The names of the vertices are stored as an attribute (and correspond to the variable names: Z, T, U, Y). Note that we can access the vertex name attribute as V(g1)$name. Running this will help us identify the order in which the vertices were entered into the igraph object:
> V(g1)$name
[1] "Z" "A" "Y" "U"
To give the vertices custom attributes:
V(g1)$label <- V(g1)$name
V(g1)$color <- c(rep("black",3),"white") 
V(g1)$size <- 7
V(g1)$label.cex <- 1.5
V(g1)$label.dist <- 2
V(g1)$label.color <- "black"
Now the edges (first see the edge sequence by running E(g1)):
E(g1)$color <- "black"
E(g1)$color[2] <- E(g1)$color[4] <- "red"
E(g1)$lty <- c(1,2,1,2,1,1)
E(g1)$label <- c(NA,"X",NA,"X",NA,NA)
E(g1)$label.color <- "red" 
E(g1)$label.cex <- 1.5
If you run plot(g1) now, you will note that the igraph library automatically arranges the vertices. To manually override this arrangement, run the following command:
tkplot(g1)
This opens an interactive graphing screen that allows you to manually move around the vertices and arrange them as you want. Now, while leaving the popup window open, extract the new coordinates for the vertices as:
coord.g1 <- tkplot.getcoords(1) 
(Note that the 1 refers to the number of the interactive window -- this one was the first I opened in this R session).

Now, pass these coordinates to the plotting of the igraph object (along with moving around the label location relative to the vertices):
plot(g1, layout=coord.g1,vertex.label.degree=c(pi,pi/2, 0, -pi/2)

Drawing DAGs: LaTeX Solution

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships between variables across a wide variety of disciplines. For an excellent (and quite accessible) textbook on the topic, please see the book Causal Inference by Miguel Hernan and Jamie Robins.

In this post, I briefly explore how you can draw DAGs in LaTeX. In the subsequent post, I will show how to draw DAGs using R.

The first example of DAG is the common instrumental variable (IV) setup:


We seek to identify the effect of treatment T on outcome Y. However, this is confounded by an unmeasured variable U. The IV is denoted as Z. Technically, we do not need to put in the crossed out red edges between U and Z and Z and Y - absence of edges encodes independence relations. However, I included them to reinforce (and make explicit) the assumptions needed for identification of causal effects using IVs:
  • (Quasi)-exogeneity of the instrument (no path from U to Z)
  • Exclusion restriction (no direct path from Z to Y)
This was made using the TikZ package in LaTeX. I used the \usepackage{pgf,tikz} command at the beginning of my document.

The code to create the DAG above is:
\begin{tikzpicture}
% nodes %
\node[text centered] (z) {$Z$};
\node[right = 1.5 of z, text centered] (t) {$T$};
\node[right=1.5 of t, text centered] (y) {$Y$};
\node[draw, rectangle, dashed, above = 1 of t, text centered] (u) {$U$};

% edges %
\draw[->, line width= 1] (z) --  (t);
\draw [->, line width= 1] (t) -- (y);
\draw[->,red, line width= 1,dashed] (u) --node {X} (z);
\draw[->,line width= 1] (u) --(t);
\draw[->,line width= 1] (u) -- (y);
\draw[->, red, line width=1,dashed] (z) to  [out=270,in=270, looseness=0.5] node{X} (y);
\end{tikzpicture}

Note that I first create the nodes (corresponding to variables in the DAG), and then draw the directed edges between the nodes.

Another example of a DAG is a simple structural equation model where we want each edge to be marked with the parameter signifying the causal effect. For example:

In LaTeX:
\begin{tikzpicture}
% nodes %
\node[text centered] (t) {$T$};
\node[right = 1.5 of t, text centered] (m) {$M$};
\node[right=1.5 of m, text centered] (y) {$Y$};

% edges %
\draw[->, line width= 1] (t) -- node[above,font=\footnotesize]{$\beta$}  (m);
\draw [->, line width= 1] (m) -- node[above,font=\footnotesize]{$\gamma$}  (y);
\draw[->, line width=1] (t) to  [out=270,in=270, looseness=0.5] node[below,font=\footnotesize]{$\delta$} (y);
\end{tikzpicture}

8/08/2012

Installing rgdal for R on Mac

rgdal provides an interface between R and the GDAL/OGR library, which provides extensive support for a variety of geospatial formats. It is extremely useful for data import and export tasks, particularly because it can read projection information (from .prj files). However, to use rgdal, one must install GDAL and other frameworks on your system first. This is a guide for how to install rgdal on a Mac.

1. Download and install the GDAL, PROJ, and GEOS frameworks: http://www.kyngchaos.com/software/frameworks. The easiest method is just to install all the frameworks using the GDAL 1.9 Complete installer.

2. Verify the directories in which the frameworks were installed. For me, the directories were as follows:
  • /Library/Frameworks/GDAL.framework/
  • /Library/Frameworks/GEOS.framework/
  • /Library/Frameworks/PROJ.framework/

3. Verify that GCC is installed. The easiest way to do so is by opening up Terminal and typing gcc --version. If you receive an error, it means that you are either missing GCC or Terminal can't find it. The easiest way to get GCC on a Mac is to install the Command Line Tools for XCode. This is what I did, and for me, when I run the command gcc --version, the output is "i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)".

4.  Download the latest version of rgdal from CRAN: http://cran.r-project.org/web/packages/rgdal/index.html.

5. Now, in Terminal, cd to the location of the tarball for rgdal and run the following command: 
R CMD INSTALL --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/unix/bin/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/unix/include --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib' rgdal_0.7-12.tar.gz

6. If the installation script runs without error in Terminal, you're ready to load the package in R using
library(rgdal).

Note: if upon attempting to load the rgdal library in R using you receive an error of the form "Error: package ‘rgdal’ is not installed for 'arch=x86_64'", this means that rgdal was installed for a different architecture of R (32-bit instead of 64-bit in this case). To install rgdal explicitly for the 64-bit architecture, you can modify Step 5 above by adding in the <code>--arch x86_64</code> option:
R --arch x86_64 CMD INSTALL --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/unix/bin/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/unix/include --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib' rgdal_0.7-12.tar.gz

8/07/2012

Installing MinionPro font for LaTeX on Mac OS X

This tutorial shows how to install the MinionPro font for LaTeX. Following these instructions should insure that you can compile LaTeX documents with the \usepackage{MinionPro} command (instead of using XeTeX). Throughout this explanation, the TeX tree that I refer to is located at: /Library/TeX/Root/texmf/. Note that this is equivalent to /usr/local/texlive/2012/texmf/ for my 2012 TeX distribution.

In Terminal, I created a variable for the TeX tree directory:
texmf_folder=/Library/TeX/Root/texmf/


Part 1: Install MnSymbol

1) Download MnSymbol package, which provides mathematical symbol font for Adobe MinionPro, from CRAN in the form of a zip file. Unzip the file to form a temporary directory (I just created a mnsymbol directory in ~/Downloads temporarily).

2) Using Terminal:
cd ~/Downloads/mnsymbol/tex
latex MnSymbol.ins 

3) Move contents from MnSymbol directory to appropriate locations of TeX tree:

First create the new directories in TeX tree:
sudo mkdir -p $texmf_folder/tex/latex/MnSymbol/
sudo mkdir -p $texmf_folder/fonts/source/public/MnSymbol/
sudo mkdir -p $texmf_folder/doc/latex/MnSymbol

Move MnSymbol.sty to /usr/local/texlive/2012/texmf/tex/latex/MnSymbol.sty:
sudo cp MnSymbol.sty $texmf_folder/tex/latex/MnSymbol/MnSymbol.sty 

Move all contents of directory source to /texmf/fonts/source/public/MnSymbol/:
cd ..
sudo cp source/* $texmf_folder/fonts/source/public/MnSymbol/

Move documentation:
sudo cp MnSymbol.pdf README $texmf_folder/doc/latex/MnSymbol/

4) To install the PostScript fonts into TeX tree:
sudo mkdir -p $texmf_folder/fonts/map/dvips/MnSymbol
sudo mkdir -p $texmf_folder/fonts/enc/dvips/MnSymbol
sudo mkdir -p $texmf_folder/fonts/type1/public/MnSymbol
sudo mkdir -p $texmf_folder/fonts/tfm/public/MnSymbol

sudo cp enc/MnSymbol.map $texmf_folder/fonts/map/dvips/MnSymbol/
sudo cp enc/*.enc $texmf_folder/fonts/enc/dvips/MnSymbol/
sudo cp pfb/*.pfb $texmf_folder/fonts/type1/public/MnSymbol/
sudo cp tfm/* $texmf_folder/fonts/tfm/public/MnSymbol/

5) Regenerate TeX tree file database:
sudo mktexlsr
sudo updmap-sys --enable MixedMap MnSymbol.map

The following file should now compile: mnsymbol-test.tex



Part 2: Install MinionPro

1) Locate MinionPro OpenType font files. On Mac OS X, fonts are generally located under /Library/Fonts/. However, on my computer the Adobe fonts are under /Applications/Adobe Reader.app/Contents/Resources/Resource/Font/ (alternatively sometimes /Library/Application Support/Adobe/Fonts).

Note that before proceeding, double-check that you have a working installation of lcdf-typetools, needed to convert the OpenType fonts. On my system, LCDF Typetools were already installed (can check in Terminal if otfinfo --help command works).

Figure out version of fonts using following command in Terminal:
otfinfo -v MinionPro-Regular.otf

Output for my fonts:
Version 2.015;PS 002.000;Core 1.0.38;makeotf.lib1.7.9032

2) Go to CTAN directory for MinionPro.
  • Download scripts.zip. This will contains convert.sh, a Unix script. Unzip as a temporary directory wherever you want (I placed in ~/Downloads).
  • Download the appropriate encoding file, corresponding to the version of the MinionPro font you are using. My MinionPro font is of version 2.000, so I downloaded enc-v2.000.zip. I left the zipfile in ~/Downloads.
  • Download metrics-base.zip (contains metrics for Regular, It, Bold, and BoldIt fonts) and, optionally, metrics-full.zip (contains additional metrics for Medium, Medium Italic, Semibold, and SemiboldItalic fonts). I left the zipfiles in ~/Downloads.

3) We need to convert the OpenType font files (.otf) to PostScript files (.pfb). In order to do this, copy the MinionPro OpenType files (MinionPro-Regular.otf, MinionPro-Bold.otf, MinionPro-It.otf, and MinionPro-BoldIt.otf) from pre-identified directory to ~Downloads/scripts/otf. 

Then in Terminal: 
cd ~/Downloads/scripts
./convert.sh

4) Copy PostScript fonts to TeX tree:
sudo mkdir -p $texmf_folder/fonts/type1/adobe/MinionPro/
sudo cp pfb/*.pfb $texmf_folder/fonts/type1/adobe/MinionPro/

5) Unzip enc-2.000.zip, metrics-base.zip, and metrics-full.zip files into the TeX directory:
cd $texmf_folder
sudo unzip ~/Downloads/enc-2.000.zip
sudo unzip ~/Downloads/metrics-base.zip
sudo unzip ~/Downloads/metrics-full.zip

6) Regenerate TeX tree file database:
sudo mktexlsr
sudo updmap-sys --enable MixedMap MinionPro.map

The following file should now compile: minionpro-test.tex


I would like to thank the following tutorials for help in figuring this out:
http://carlo-hamalainen.net/blog/2007/12/11/installing-minion-pro-fonts/
http://jklukas.blogspot.com/2010/02/installing-minionpro-tex-package.html

My favorite applications for OS X

I've recently upgraded to the mid-2012 version of the MacBook Pro, which has given me a chance to completely install all my programs from scratch, as well as re-evaluate the applications I use. After significant research, here are the the applications that I have found extremely useful for Mac OS X, by category:


Productivity and Organization Tools:
  • Evernote: great note-taking and web clipping utility that synchs with the web and across devices (including iPhone). Also check out Evernote Trunk for many additional, productivity-enhancing add-ons.
  • Papers2: pdf organization program (costs ~$50 with student discount)
  • 1Password: password manager (costs $50)
  • Transmit: Mac OS X FTP client

Text Editors:
  • Aquamacs: An editor for text, HTML, LaTeX, C++, Java, Python, R, Perl, Ruby, PHP, and other languages (I use this mostly for LaTeX and Python). Based on GNU Emacs platform. 
  • TextWrangler: professional HTML and text editor (and for other languages).

Graphics:
  • Seashore: open source image editor for Mac. It is simpler to use than GIMP because it seeks to provide basic image editing needs, not to replace professional image editing products.
  • Gimp: GNU Image Manipulation Program that is great open-source replacement for Photoshop. Useful for photo retouching, image composition and image authoring.
Note: Gimp requires X11, an application that implements the X Window System. However, Apple no longer provides X11 support for Mountain Lion directly (as X11.app). However, one can download XQuartz, which is an open source version of X11 that runs on Mountain Lion and allows to run all X11-based applications in Mac OS X.


PDF Reader:
  • Skim: PDF reader and note-taker. Integrates with Timestamper for Mac application developed for   recording timing of pdf (Beamer) slides during a lecture.

Music and Videos:
  • Spotify: stream music for free.
  • VLC: open source and cross-platform multimedia player.

Security:
  • ClamXav: free virus checker for Mac OS X. Based upon open source ClamAV engine.

Developer Tools:
  • XCode: Apple's developer environment. Includes GCC, the GNU Compiler Collection. I recommend using the Command Line Tools for XCode, which can be installed from within XCode using the "Downloads" pane within "Preferences". Command Line Tools provide UNIX-style development via Terminal by installing command line developer tools (as well as MAC OS X SDK frameworks). Note that these can be installed without installing XCode here.
  • Homebrew: command line package manager for Mac OS X.

4/07/2012

Plotting side-by-side in ggplot2

Here's a quick example of plotting histograms next to one another in ggplot2. I wanted to plot the estimated propensity scores for treated and control units for the Lalonde non-experimental data.

First, generate the estimate propensity scores using the fitted values from a logit model where the dependent variable is the treatment indicator. The dataframe (stripped of the outcome) is called dta.nooutcome.


glm1  <- glm(TREAT~MARR+NODEGREE+BLACK+HISPANIC+EDUC+AGE+RE74+RE75+U74+U75, family=binomial, data=dta.nooutcome)
pscore1 <- glm1$fitted

Specifically, we can make two histograms for the distribution of the propensity scores corresponding to the two treatment levels.

plot1<- ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==1]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Active Treatment Units")

plot2<-ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==0]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="blue") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Control Treatment Units") 

To arrange the two plots next to each other, we can use the gridExtra library.
grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that one can pass the number of rows and columns to the grid.layout() option. Without explicitly specifying width and height of each cell, Viewport automatically arranges the plots.

The result is:


If we want to add numbers to the top of each bar to make it easier to see the number of units in each propensity score bucket, we can modify the existing plots as follows:

plot1a <- plot1 + 
geom_text(aes(x=ggplot_build(plot1)$data[[1]]$x, y=ggplot_build(plot1)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot1)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")
plot2a <- plot2 + 
geom_text(aes(x=ggplot_build(plot2)$data[[1]]$x, y=ggplot_build(plot2)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot2)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1a, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2a, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that we extract the counts of units in each of the histogram bins using the ggplot_build() function and then subsetting the data. Alternatively, one could go about this whole exercise by first manually putting together a dataframe with the bins and the associated counts and passing it to the geom_bar() and geom_text() functions within the ggplot package.

The result is: