8/08/2012

Installing rgdal for R on Mac

rgdal provides an interface between R and the GDAL/OGR library, which provides extensive support for a variety of geospatial formats. It is extremely useful for data import and export tasks, particularly because it can read projection information (from .prj files). However, to use rgdal, one must install GDAL and other frameworks on your system first. This is a guide for how to install rgdal on a Mac.

1. Download and install the GDAL, PROJ, and GEOS frameworks: http://www.kyngchaos.com/software/frameworks. The easiest method is just to install all the frameworks using the GDAL 1.9 Complete installer.

2. Verify the directories in which the frameworks were installed. For me, the directories were as follows:
  • /Library/Frameworks/GDAL.framework/
  • /Library/Frameworks/GEOS.framework/
  • /Library/Frameworks/PROJ.framework/

3. Verify that GCC is installed. The easiest way to do so is by opening up Terminal and typing gcc --version. If you receive an error, it means that you are either missing GCC or Terminal can't find it. The easiest way to get GCC on a Mac is to install the Command Line Tools for XCode. This is what I did, and for me, when I run the command gcc --version, the output is "i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)".

4.  Download the latest version of rgdal from CRAN: http://cran.r-project.org/web/packages/rgdal/index.html.

5. Now, in Terminal, cd to the location of the tarball for rgdal and run the following command: 
R CMD INSTALL --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/unix/bin/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/unix/include --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib' rgdal_0.7-12.tar.gz

6. If the installation script runs without error in Terminal, you're ready to load the package in R using
library(rgdal).

Note: if upon attempting to load the rgdal library in R using you receive an error of the form "Error: package ‘rgdal’ is not installed for 'arch=x86_64'", this means that rgdal was installed for a different architecture of R (32-bit instead of 64-bit in this case). To install rgdal explicitly for the 64-bit architecture, you can modify Step 5 above by adding in the <code>--arch x86_64</code> option:
R --arch x86_64 CMD INSTALL --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/unix/bin/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/unix/include --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib' rgdal_0.7-12.tar.gz

8/07/2012

Installing MinionPro font for LaTeX on Mac OS X

This tutorial shows how to install the MinionPro font for LaTeX. Following these instructions should insure that you can compile LaTeX documents with the \usepackage{MinionPro} command (instead of using XeTeX). Throughout this explanation, the TeX tree that I refer to is located at: /Library/TeX/Root/texmf/. Note that this is equivalent to /usr/local/texlive/2012/texmf/ for my 2012 TeX distribution.

In Terminal, I created a variable for the TeX tree directory:
texmf_folder=/Library/TeX/Root/texmf/


Part 1: Install MnSymbol

1) Download MnSymbol package, which provides mathematical symbol font for Adobe MinionPro, from CRAN in the form of a zip file. Unzip the file to form a temporary directory (I just created a mnsymbol directory in ~/Downloads temporarily).

2) Using Terminal:
cd ~/Downloads/mnsymbol/tex
latex MnSymbol.ins 

3) Move contents from MnSymbol directory to appropriate locations of TeX tree:

First create the new directories in TeX tree:
sudo mkdir -p $texmf_folder/tex/latex/MnSymbol/
sudo mkdir -p $texmf_folder/fonts/source/public/MnSymbol/
sudo mkdir -p $texmf_folder/doc/latex/MnSymbol

Move MnSymbol.sty to /usr/local/texlive/2012/texmf/tex/latex/MnSymbol.sty:
sudo cp MnSymbol.sty $texmf_folder/tex/latex/MnSymbol/MnSymbol.sty 

Move all contents of directory source to /texmf/fonts/source/public/MnSymbol/:
cd ..
sudo cp source/* $texmf_folder/fonts/source/public/MnSymbol/

Move documentation:
sudo cp MnSymbol.pdf README $texmf_folder/doc/latex/MnSymbol/

4) To install the PostScript fonts into TeX tree:
sudo mkdir -p $texmf_folder/fonts/map/dvips/MnSymbol
sudo mkdir -p $texmf_folder/fonts/enc/dvips/MnSymbol
sudo mkdir -p $texmf_folder/fonts/type1/public/MnSymbol
sudo mkdir -p $texmf_folder/fonts/tfm/public/MnSymbol

sudo cp enc/MnSymbol.map $texmf_folder/fonts/map/dvips/MnSymbol/
sudo cp enc/*.enc $texmf_folder/fonts/enc/dvips/MnSymbol/
sudo cp pfb/*.pfb $texmf_folder/fonts/type1/public/MnSymbol/
sudo cp tfm/* $texmf_folder/fonts/tfm/public/MnSymbol/

5) Regenerate TeX tree file database:
sudo mktexlsr
sudo updmap-sys --enable MixedMap MnSymbol.map

The following file should now compile: mnsymbol-test.tex



Part 2: Install MinionPro

1) Locate MinionPro OpenType font files. On Mac OS X, fonts are generally located under /Library/Fonts/. However, on my computer the Adobe fonts are under /Applications/Adobe Reader.app/Contents/Resources/Resource/Font/ (alternatively sometimes /Library/Application Support/Adobe/Fonts).

Note that before proceeding, double-check that you have a working installation of lcdf-typetools, needed to convert the OpenType fonts. On my system, LCDF Typetools were already installed (can check in Terminal if otfinfo --help command works).

Figure out version of fonts using following command in Terminal:
otfinfo -v MinionPro-Regular.otf

Output for my fonts:
Version 2.015;PS 002.000;Core 1.0.38;makeotf.lib1.7.9032

2) Go to CTAN directory for MinionPro.
  • Download scripts.zip. This will contains convert.sh, a Unix script. Unzip as a temporary directory wherever you want (I placed in ~/Downloads).
  • Download the appropriate encoding file, corresponding to the version of the MinionPro font you are using. My MinionPro font is of version 2.000, so I downloaded enc-v2.000.zip. I left the zipfile in ~/Downloads.
  • Download metrics-base.zip (contains metrics for Regular, It, Bold, and BoldIt fonts) and, optionally, metrics-full.zip (contains additional metrics for Medium, Medium Italic, Semibold, and SemiboldItalic fonts). I left the zipfiles in ~/Downloads.

3) We need to convert the OpenType font files (.otf) to PostScript files (.pfb). In order to do this, copy the MinionPro OpenType files (MinionPro-Regular.otf, MinionPro-Bold.otf, MinionPro-It.otf, and MinionPro-BoldIt.otf) from pre-identified directory to ~Downloads/scripts/otf. 

Then in Terminal: 
cd ~/Downloads/scripts
./convert.sh

4) Copy PostScript fonts to TeX tree:
sudo mkdir -p $texmf_folder/fonts/type1/adobe/MinionPro/
sudo cp pfb/*.pfb $texmf_folder/fonts/type1/adobe/MinionPro/

5) Unzip enc-2.000.zip, metrics-base.zip, and metrics-full.zip files into the TeX directory:
cd $texmf_folder
sudo unzip ~/Downloads/enc-2.000.zip
sudo unzip ~/Downloads/metrics-base.zip
sudo unzip ~/Downloads/metrics-full.zip

6) Regenerate TeX tree file database:
sudo mktexlsr
sudo updmap-sys --enable MixedMap MinionPro.map

The following file should now compile: minionpro-test.tex


I would like to thank the following tutorials for help in figuring this out:
http://carlo-hamalainen.net/blog/2007/12/11/installing-minion-pro-fonts/
http://jklukas.blogspot.com/2010/02/installing-minionpro-tex-package.html

My favorite applications for OS X

I've recently upgraded to the mid-2012 version of the MacBook Pro, which has given me a chance to completely install all my programs from scratch, as well as re-evaluate the applications I use. After significant research, here are the the applications that I have found extremely useful for Mac OS X, by category:


Productivity and Organization Tools:
  • Evernote: great note-taking and web clipping utility that synchs with the web and across devices (including iPhone). Also check out Evernote Trunk for many additional, productivity-enhancing add-ons.
  • Papers2: pdf organization program (costs ~$50 with student discount)
  • 1Password: password manager (costs $50)
  • Transmit: Mac OS X FTP client

Text Editors:
  • Aquamacs: An editor for text, HTML, LaTeX, C++, Java, Python, R, Perl, Ruby, PHP, and other languages (I use this mostly for LaTeX and Python). Based on GNU Emacs platform. 
  • TextWrangler: professional HTML and text editor (and for other languages).

Graphics:
  • Seashore: open source image editor for Mac. It is simpler to use than GIMP because it seeks to provide basic image editing needs, not to replace professional image editing products.
  • Gimp: GNU Image Manipulation Program that is great open-source replacement for Photoshop. Useful for photo retouching, image composition and image authoring.
Note: Gimp requires X11, an application that implements the X Window System. However, Apple no longer provides X11 support for Mountain Lion directly (as X11.app). However, one can download XQuartz, which is an open source version of X11 that runs on Mountain Lion and allows to run all X11-based applications in Mac OS X.


PDF Reader:
  • Skim: PDF reader and note-taker. Integrates with Timestamper for Mac application developed for   recording timing of pdf (Beamer) slides during a lecture.

Music and Videos:
  • Spotify: stream music for free.
  • VLC: open source and cross-platform multimedia player.

Security:
  • ClamXav: free virus checker for Mac OS X. Based upon open source ClamAV engine.

Developer Tools:
  • XCode: Apple's developer environment. Includes GCC, the GNU Compiler Collection. I recommend using the Command Line Tools for XCode, which can be installed from within XCode using the "Downloads" pane within "Preferences". Command Line Tools provide UNIX-style development via Terminal by installing command line developer tools (as well as MAC OS X SDK frameworks). Note that these can be installed without installing XCode here.
  • Homebrew: command line package manager for Mac OS X.

4/07/2012

Plotting side-by-side in ggplot2

Here's a quick example of plotting histograms next to one another in ggplot2. I wanted to plot the estimated propensity scores for treated and control units for the Lalonde non-experimental data.

First, generate the estimate propensity scores using the fitted values from a logit model where the dependent variable is the treatment indicator. The dataframe (stripped of the outcome) is called dta.nooutcome.


glm1  <- glm(TREAT~MARR+NODEGREE+BLACK+HISPANIC+EDUC+AGE+RE74+RE75+U74+U75, family=binomial, data=dta.nooutcome)
pscore1 <- glm1$fitted

Specifically, we can make two histograms for the distribution of the propensity scores corresponding to the two treatment levels.

plot1<- ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==1]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Active Treatment Units")

plot2<-ggplot() + 
geom_histogram(aes(x = pscore1[dta.nooutcome$TREAT==0]), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="blue") + 
scale_y_continuous(name="Count") + 
scale_x_continuous(name="Estimated Propensity Score", limits=c(0,1)) + 
theme_bw() + 
opts(title="Control Treatment Units") 

To arrange the two plots next to each other, we can use the gridExtra library.
grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that one can pass the number of rows and columns to the grid.layout() option. Without explicitly specifying width and height of each cell, Viewport automatically arranges the plots.

The result is:


If we want to add numbers to the top of each bar to make it easier to see the number of units in each propensity score bucket, we can modify the existing plots as follows:

plot1a <- plot1 + 
geom_text(aes(x=ggplot_build(plot1)$data[[1]]$x, y=ggplot_build(plot1)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot1)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")
plot2a <- plot2 + 
geom_text(aes(x=ggplot_build(plot2)$data[[1]]$x, y=ggplot_build(plot2)$data[[1]]$count , label = sprintf("%2.0f", ggplot_build(plot2)$data[[1]]$count)), position = position_dodge(width = 1), size = 3.5, vjust = -0.3, colour = "grey10")

grid.newpage()
pushViewport(viewport(layout=grid.layout(2,1)))
print(plot1a, vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plot2a, vp=viewport(layout.pos.row=2, layout.pos.col=1))

Note that we extract the counts of units in each of the histogram bins using the ggplot_build() function and then subsetting the data. Alternatively, one could go about this whole exercise by first manually putting together a dataframe with the bins and the associated counts and passing it to the geom_bar() and geom_text() functions within the ggplot package.

The result is:

Density Plots and Histograms in ggplot2

In this post, I'm going to go through how to make plots of distributions (either density plots or histograms) in ggplot2. I'm going to draw upon examples of Fisherian testing in the context of causal inference, but the examples should be completely understandable without knowledge of Fisher's approach to inference.

For example, suppose that we want to plot the randomization distribution of a test statistic (difference in means in this case) under the sharp null of zero treatment effect across all units. The possible values of the test statistic across 1000 randomly sampled randomizations are in the vector test.stat. Moreover, the actually observed test statistic is tau.hat. Not only do we want to plot the density, but we want to shade in the area under the density that corresponds to the one-sided p-value of this test statistic.

We can first turn the vector into a data frame that contains the values as x and the corresponding density as y:
dd <- with(density(test.stat),data.frame(x,y))

Then plot in ggplot:
ggplot(data = dd, mapping = aes(x = x, y = y)) +
geom_line(color="black") + layer(data = dd, mapping = aes(x=ifelse(x < tau.hat,x,tau.hat), y=y), geom = "area", geom_params=list(fill="red",alpha=.3)) + 
scale_y_continuous(limits = c(0,max(dd$y)), name="Density") + 
scale_x_continuous(name="Difference in Means") + 
geom_vline(aes(xintercept=tau.hat.1a), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-42, y2=0.015, text2="T[obs]"), color=I("red"), parse=TRUE) + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-60, y2=0.0025, text2="p-value = 0.041"), size=3)


Several comments on the syntax passed to the ggplot() function above:
  • The layer() function with the geom="area" option fills the area under the distribution corresponding to the one-sided p-value (probability of observing a value of the test statistic at least as small as the one we observe).
  • I manually pass the location of the text into the geom_text() function in this case by creating a data frame within the actual geom_text() function.
  • Note that in order for ggplot to write expressions with text, such as subscripts which I use above, I pass the parse=TRUE option to geom_text()


Here's another example that involves plotting the distribution of p-values under the sharp null of a constant additive treatment effect for each unit. I wanted to plot the values of the resultant test statistics on the x-axis, the p-values for the sharp null on the y-axis, and then shade the 95% Fisherian confidence interval. In this case, tau.vector contains the values of the test statistic (difference in means). p.vector is a corresponding vector that contains the one-sided exact Fisher p-value for the sharp null hypothesis that the effect is 0. Note that I placed a dotted horizontal red line at a p-value of 0.025 to obtain approximately 95% coverage on my confidence interval.

ggplot(data = CI.2c, mapping = aes(x = tau.vector, y = p.vector)) +
geom_line(color="black") + 
scale_y_continuous(limits = c(0,max(CI.2c$p.vector)),name="p-value") +
scale_x_continuous(name="Difference in Means") + 
layer(data = CI.2c, mapping = aes(x=ifelse((lb.2c < tau.vector & tau.vector < ub.2c),tau.vector,lb.2c), y=p.vector), geom = "area", geom_params=list(fill="yellow",alpha=.3)) + 
geom_hline(aes(yintercept=0.025), color="red", linetype="dashed") + 
geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-100, y2=0.1, text2="95% CI: [-83,6.33]"), size=5)




Here's a very simple histogram example. Suppose that pval.5 contains a vector of p-values across all possible assignment vectors. We want to plot the distribution of Fisher's exact p-values, which should be uniformly distributed between 0 and 1 under the null (and generally when dealing with continuous test statistics). We can do so simply using ggplot as:

ggplot() + 
geom_histogram(aes(x = pval.5), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") +
scale_y_continuous(name="Count") + 
scale_x_continuous(name="p-value")

4/02/2012

Factor Orderings in ggplot2

One of the occasionally annoying features of R and thus ggplot2 is dealing with factors. In this post, I'll go through how to handle ordering of factors in ggplot2 and the manual assignment of colors to those categories.

For this example, I'm going to use data collected by Daniel Ziblatt on the European revolutions of 1848. Specifically, for each county in the data, we can code the county as having a revolution, having a constitutional concession, or no revolution. For each county, we also know distance from Paris (in kilometers) and the price shock in wheat in 1847. These are meant to test arguments made about the locations of of the 1848 revolutions: that countries that experienced them were closer to Paris and the French Revolution (diffusion argument) and/or revolutions were the result of economic hardships.

An initial attempt at plotting this may look something like this:

ggplot(my.data, aes(Distance, PriceSpike, color=Revolution)) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) +  
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") + 
theme_bw()  +
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))


The result is:


Note that the factor variable that we are plotting is Revolution and ggplot2 automatically orders it in the legend (alphabetically) and assigns it default colors. Also note that I manually positioned the legend within the plotting space.

What if we wanted "Revolution" to come first in the legend, followed by "Constitutional Concession", then "No Revolution"? Moreover, what if we wanted the color for "Revolution" to be red and the color for "No Revolution" to be green?

We can create an additional vector of the length of the number of observations in the my.data dataframe called Revolution.order:
Revolution.order <- my.data$Revolution
Revolution.order[Revolution.order=="Revolution"] <- 1
Revolution.order[Revolution.order=="Constitutional Concession"] <- 2
Revolution.order[Revolution.order=="No Revolution"] <- 3
Revolution.order <- as.integer(Revolution.order)


This vector now contains the ordering of the factor variable which we can feed into ggplot through the reorder() function:

ggplot(my.data, aes(Distance, PriceSpike, color=reorder(Revolution,Revolution.order))) +
geom_point(shape=20, size=5) +  
geom_text(aes(label=Country), size=3, hjust=-0.15) + 
scale_x_continuous("Distance from Paris (km)") + 
scale_y_continuous("Price Shock in 1847 \n (Price of Wheat in 1847/Average Price of Wheat 1830-1845)") +  
theme_bw() + 
opts(title="Food Price Shock and Distance from Paris in the Revolutions of 1848") + 
scale_color_manual(name="", values=c("#D7301F", "#0570B0", "#2CA25F")) + 
opts(legend.position=c(.12,.9), legend.background=theme_rect(fill="white"))

Note that we can manually supply the colors for the factor through scale_color_manual().

The result is:


2/25/2012

Trellis graphs in ggplot2

Trellis graphs are an informative way of visualizing relationships between variables conditional on other variable(s). In R, one can make trellis plots either through the lattice package, or ggplot2. In this post, I'll show how to make such graphs in ggplot2.

I'm still going to use the Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Suppose we want to plot the relationship between GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic), conditional upon the region (Region8a variable).

The more common way to do so is using the lattice package in R:

library(lattice)
xyplot(Polity3 ~ log(GDP2000) | Region8a, data = norris.data, 
xlab = "Logarithm of GDP (PPP annual, WB)", 
ylab = "Polity Score (0=Full Authoritarianism, 20=Full Democracy)", 
scales="free", pch=19, panel = function(x, y) {
panel.xyplot(x,y,pch=19, cex=0.5, col="black")
panel.lmline(x,y,col="red", lty=2)
})

The result is:

Note that the scales="free" option fits xlim and ylim for each region independently.  panel.xyplot() plots the points and  panel.lmline() fits the linear regression line for each region. Now, suppose we wanted to do this in ggplot. We are going to make use of the facet_wrap() command to create a trellis plot.

The syntax will be:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +  
geom_point(shape=20, size=3) + 
facet_wrap(~Region8a, ncol=4, scales="free") + 
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") + 
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_smooth(method="lm", color="red")

The result is:


One could remove the confidence intervals by including se=FALSE within the geom_smooth() function. One could also force the regression lines to extrapolate beyond the data by including fullrange=TRUE within the geom_smooth() function. Finally, to change the color of the confidence interval from the default gray, we can specify including fill="blue", for example, within the geom_smooth() function. Note that in general, the geom_smooth() function is quite versatile and allows a variety of smoothing methods: you can specify lm, glm, gam, loess, or rlm (not exhaustive list) for the method argument.

An alternative, more flexible, way to do this is to first create a dataframe using the plyr package that contains the intercepts and slopes for the linear regression for each region. So instead of automatically passing down the aesthetics on to the geom_smooth() function as in the previous example, we can manually specify the aesthetics of the lines we want to draw using the geom_abline() function. 

library(plyr)
reg.df <- ddply(norris.data, .(Region8a), function(i)
               lm(Polity3 ~ log(GDP2000), data = i)$coefficients[1:2])
names(reg.df)[2] <- "intercept"
names(reg.df)[3] <- "logGDP2000"

Then we make the trellis plot as:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_abline(aes(intercept = intercept, slope = logGDP2000), color = "red", data = reg.df)

A slightly different example of how to create trellis plots comes from the time series version of the Pippa Norris data. Suppose we want to look at the relationship between economic development and democracy over time just within easy country Central and Eastern Europe. One way to do so (as an exploratory tool at least) would be as follows:

CEE.lattice <- ggplot(norris.ts.CEE.data[norris.ts.CEE.data$Year>=1980,], aes(logGDP_WB, Polity1, color=Year)) +
geom_path(color="black") +
geom_point(shape=20, size=3) +
facet_wrap(~Nation, ncol=5, scales="free") +
scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
opts(title="Relationship between Econ. Development and Democracy in Central and Eastern Europe")

The result is:

Note that scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") colors the points with a gradient from yellow to blue based on the year. I manually set the breaks in the gradient. Also, note that I use geom_path() to connect the points using the order of the observations. This is different than the geom_line() command, which would connect the points in order of the x-variable, which is GDP in this case (and not time).

Finally, and this holds across all of ggplot, if we wanted to make the background layer of the plots white instead of gray, we could simply pass the following argument to ggplot as a layer:

CEE.lattice + theme_bw()



The result is:

2/12/2012

Better graphics in R: intro to ggplot2

I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of ggplot2 as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is lattice).

Let's look at the "grammar" of ggplot2. The 2 basic types of input into ggplot2 are:
  • Aesthetics: x position, y position, shape / size / color of elements
  • Elements / geometric shapes: points, lines, bars, text, etc.
A ggplot is invoked using the ggplot2() command. The aesthetics are typically passed to the plotting function using the aes() function. The logic behind ggplot2 is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of ggplot2 will become apparent in the examples.

There is also a quick plotting function withing ggplot2 known as qplot(). It is an easier place to start because it is meant to resemble the syntax of the plot() command within the base package in R. Let's start with one example of qplot() and then shift to ggplot() since the functionality is so much richer in the latter.

I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.

We can create a simple scatterplot as follows:
qplot(log(GDP2000), Polity3, data=norris.data, 
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)", 
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")

This syntax should be quite similar to that of plot(), the only real difference being that we use size instead of cex to control point size, use an I() around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify alpha=.75 to create some transparency in the points. These are ggplot aesthetic names and they differ from those in plot(). Also note that we must add + opts(legend.position="none") to remove the legend that would otherwise show up to the right of the plot. The result is:


Now, we can make the exact same graph, but using ggplot():

plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")

In this case, note how we first define plot2 as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call plot2 and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.

We can add a regression line to the previous plot as yet another layer as follows:
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
 + scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_abline(color="red")



Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
scale_color_discrete(name = "Regions")

Note that specifically we passed color=Region8a as an argument to the initial aes() function. scale_color_discrete(name = "Regions" was invoked to change the name of the legend.


We can add country names just below the points in the previous plot as follows (again, separate layer):
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) + 
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")


If we just want country names and no points:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+ 
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+  
scale_color_discrete(name = "Regions")

Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying size=Aid2002 in the original aesthetics:

ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")

2/08/2012

Python and GML/XML Parsing

For my inaugural post, I will show the power and versatility of Python to address a conceptually straightforward problem that was a pain to solve through other means. Specifically, I had geospatial data for Poland in GML format (essentially XML but for spatial coordinates). However, the GML schema was missing and an initial attempt to read the file using QGIS resulted in latitude and longitude being reversed (and Poland was consequently rotated and mapped onto the Arab peninsula).

I attempted to solve this a variety of ways: I tried to use the ogr2ogr function in the GDAL framework to convert the GML file to an ESRI shapefile as follows:

ogr2ogr --config GML_INVERT_AXIS_ORDER_IF_LAT_LONG NO -f "ESRI Shapefile" -a_srs "EPSG:4326" gminy.shp admin_gminy.aspx.xml

Yet, it was still reversing latitude and longitude. An attempt to use FME - a proprietary software package for geospatial data conversion - also failed because it could not find the schema file. 

In the end, the quickest solution was to write a quick Python script to parse the XML tree and exchange the order of the coordinates. I wish I had just started with this since it took all of 15 minutes to write and would have saved me a lot of time.

Note that the geographic coordinates in the xml format appear within relevant elements as:

51.5174171815126 15.725049057718 51.5173922307042 15.7006014584274 51.5158010559858 15.6992569374652 ... 

My goal was to parse these elements and exchange the order of the coordinates so that the example above would be:

15.725049057718 51.5174171815126 15.7006014584274 51.5173922307042 15.6992569374652 51.5158010559858 ... 



Here is the Python script, which touches upon how to parse XML using the Element Tree module:

# import ElementTree module
from xml.etree import ElementTree as ET

# indicate xml file location
file_xml = '~/Desktop/admin_gminy.aspx.xml'

# open up file and parse xml tree using ElementTree module
tree = ET.parse(file_xml)

# if you want to iterate over all the elements of the tree and print them 
# (warning: can be really long)
for t in tree.iter():
    print t

# the elements of the tree that store geographic coordinates in this 
#case have the tag '{http://www.opengis.net/gml}posList' 
# this for loop will iterate over every element with relevant tag
for t in tree.iter('{http://www.opengis.net/gml}posList'):
    # split string into list of coordinates
    full_cord = t.text.split()

    # create new list of NAs for reversed coordinates
    new_cord = ['NA'] * len(full_cord)

    # for each coordinate if it's odd we want to move it up by one position, 
    # if it's even we want to shift it back by one position
    # this essentially reverses the order of latitude & longitude
    for i in range(len(full_cord)):
        if i%2==0:
            new_cord[i+1] = full_cord[i]
        if i%2==1:
            new_cord[i-1] = full_cord[i]

    # update xml tree element with reversed coordinates
    t.text = " ".join(new_cord)

# export new tree
file_out =  '~/Desktop/admin_gminy_rev.xml'
tree.write(file_out, encoding="utf-8", xml_declaration=None, method="xml")