2/25/2012

Trellis graphs in ggplot2

Trellis graphs are an informative way of visualizing relationships between variables conditional on other variable(s). In R, one can make trellis plots either through the lattice package, or ggplot2. In this post, I'll show how to make such graphs in ggplot2.

I'm still going to use the Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Suppose we want to plot the relationship between GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic), conditional upon the region (Region8a variable).

The more common way to do so is using the lattice package in R:

library(lattice)
xyplot(Polity3 ~ log(GDP2000) | Region8a, data = norris.data, 
xlab = "Logarithm of GDP (PPP annual, WB)", 
ylab = "Polity Score (0=Full Authoritarianism, 20=Full Democracy)", 
scales="free", pch=19, panel = function(x, y) {
panel.xyplot(x,y,pch=19, cex=0.5, col="black")
panel.lmline(x,y,col="red", lty=2)
})

The result is:

Note that the scales="free" option fits xlim and ylim for each region independently.  panel.xyplot() plots the points and  panel.lmline() fits the linear regression line for each region. Now, suppose we wanted to do this in ggplot. We are going to make use of the facet_wrap() command to create a trellis plot.

The syntax will be:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +  
geom_point(shape=20, size=3) + 
facet_wrap(~Region8a, ncol=4, scales="free") + 
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") + 
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_smooth(method="lm", color="red")

The result is:


One could remove the confidence intervals by including se=FALSE within the geom_smooth() function. One could also force the regression lines to extrapolate beyond the data by including fullrange=TRUE within the geom_smooth() function. Finally, to change the color of the confidence interval from the default gray, we can specify including fill="blue", for example, within the geom_smooth() function. Note that in general, the geom_smooth() function is quite versatile and allows a variety of smoothing methods: you can specify lm, glm, gam, loess, or rlm (not exhaustive list) for the method argument.

An alternative, more flexible, way to do this is to first create a dataframe using the plyr package that contains the intercepts and slopes for the linear regression for each region. So instead of automatically passing down the aesthetics on to the geom_smooth() function as in the previous example, we can manually specify the aesthetics of the lines we want to draw using the geom_abline() function. 

library(plyr)
reg.df <- ddply(norris.data, .(Region8a), function(i)
               lm(Polity3 ~ log(GDP2000), data = i)$coefficients[1:2])
names(reg.df)[2] <- "intercept"
names(reg.df)[3] <- "logGDP2000"

Then we make the trellis plot as:
ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_abline(aes(intercept = intercept, slope = logGDP2000), color = "red", data = reg.df)

A slightly different example of how to create trellis plots comes from the time series version of the Pippa Norris data. Suppose we want to look at the relationship between economic development and democracy over time just within easy country Central and Eastern Europe. One way to do so (as an exploratory tool at least) would be as follows:

CEE.lattice <- ggplot(norris.ts.CEE.data[norris.ts.CEE.data$Year>=1980,], aes(logGDP_WB, Polity1, color=Year)) +
geom_path(color="black") +
geom_point(shape=20, size=3) +
facet_wrap(~Nation, ncol=5, scales="free") +
scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
opts(title="Relationship between Econ. Development and Democracy in Central and Eastern Europe")

The result is:

Note that scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") colors the points with a gradient from yellow to blue based on the year. I manually set the breaks in the gradient. Also, note that I use geom_path() to connect the points using the order of the observations. This is different than the geom_line() command, which would connect the points in order of the x-variable, which is GDP in this case (and not time).

Finally, and this holds across all of ggplot, if we wanted to make the background layer of the plots white instead of gray, we could simply pass the following argument to ggplot as a layer:

CEE.lattice + theme_bw()



The result is:

2/12/2012

Better graphics in R: intro to ggplot2

I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of ggplot2 as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is lattice).

Let's look at the "grammar" of ggplot2. The 2 basic types of input into ggplot2 are:
  • Aesthetics: x position, y position, shape / size / color of elements
  • Elements / geometric shapes: points, lines, bars, text, etc.
A ggplot is invoked using the ggplot2() command. The aesthetics are typically passed to the plotting function using the aes() function. The logic behind ggplot2 is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of ggplot2 will become apparent in the examples.

There is also a quick plotting function withing ggplot2 known as qplot(). It is an easier place to start because it is meant to resemble the syntax of the plot() command within the base package in R. Let's start with one example of qplot() and then shift to ggplot() since the functionality is so much richer in the latter.

I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.

We can create a simple scatterplot as follows:
qplot(log(GDP2000), Polity3, data=norris.data, 
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)", 
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")

This syntax should be quite similar to that of plot(), the only real difference being that we use size instead of cex to control point size, use an I() around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify alpha=.75 to create some transparency in the points. These are ggplot aesthetic names and they differ from those in plot(). Also note that we must add + opts(legend.position="none") to remove the legend that would otherwise show up to the right of the plot. The result is:


Now, we can make the exact same graph, but using ggplot():

plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")

In this case, note how we first define plot2 as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call plot2 and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.

We can add a regression line to the previous plot as yet another layer as follows:
plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
 + scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_abline(color="red")



Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
scale_color_discrete(name = "Regions")

Note that specifically we passed color=Region8a as an argument to the initial aes() function. scale_color_discrete(name = "Regions" was invoked to change the name of the legend.


We can add country names just below the points in the previous plot as follows (again, separate layer):
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) + 
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")


If we just want country names and no points:
ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+ 
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)")+  
scale_color_discrete(name = "Regions")

Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying size=Aid2002 in the original aesthetics:

ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) + 
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") + 
scale_x_continuous("Log of Per Capita GDP (2000)") + 
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")

2/08/2012

Python and GML/XML Parsing

For my inaugural post, I will show the power and versatility of Python to address a conceptually straightforward problem that was a pain to solve through other means. Specifically, I had geospatial data for Poland in GML format (essentially XML but for spatial coordinates). However, the GML schema was missing and an initial attempt to read the file using QGIS resulted in latitude and longitude being reversed (and Poland was consequently rotated and mapped onto the Arab peninsula).

I attempted to solve this a variety of ways: I tried to use the ogr2ogr function in the GDAL framework to convert the GML file to an ESRI shapefile as follows:

ogr2ogr --config GML_INVERT_AXIS_ORDER_IF_LAT_LONG NO -f "ESRI Shapefile" -a_srs "EPSG:4326" gminy.shp admin_gminy.aspx.xml

Yet, it was still reversing latitude and longitude. An attempt to use FME - a proprietary software package for geospatial data conversion - also failed because it could not find the schema file. 

In the end, the quickest solution was to write a quick Python script to parse the XML tree and exchange the order of the coordinates. I wish I had just started with this since it took all of 15 minutes to write and would have saved me a lot of time.

Note that the geographic coordinates in the xml format appear within relevant elements as:

51.5174171815126 15.725049057718 51.5173922307042 15.7006014584274 51.5158010559858 15.6992569374652 ... 

My goal was to parse these elements and exchange the order of the coordinates so that the example above would be:

15.725049057718 51.5174171815126 15.7006014584274 51.5173922307042 15.6992569374652 51.5158010559858 ... 



Here is the Python script, which touches upon how to parse XML using the Element Tree module:

# import ElementTree module
from xml.etree import ElementTree as ET

# indicate xml file location
file_xml = '~/Desktop/admin_gminy.aspx.xml'

# open up file and parse xml tree using ElementTree module
tree = ET.parse(file_xml)

# if you want to iterate over all the elements of the tree and print them 
# (warning: can be really long)
for t in tree.iter():
    print t

# the elements of the tree that store geographic coordinates in this 
#case have the tag '{http://www.opengis.net/gml}posList' 
# this for loop will iterate over every element with relevant tag
for t in tree.iter('{http://www.opengis.net/gml}posList'):
    # split string into list of coordinates
    full_cord = t.text.split()

    # create new list of NAs for reversed coordinates
    new_cord = ['NA'] * len(full_cord)

    # for each coordinate if it's odd we want to move it up by one position, 
    # if it's even we want to shift it back by one position
    # this essentially reverses the order of latitude & longitude
    for i in range(len(full_cord)):
        if i%2==0:
            new_cord[i+1] = full_cord[i]
        if i%2==1:
            new_cord[i-1] = full_cord[i]

    # update xml tree element with reversed coordinates
    t.text = " ".join(new_cord)

# export new tree
file_out =  '~/Desktop/admin_gminy_rev.xml'
tree.write(file_out, encoding="utf-8", xml_declaration=None, method="xml")