## 2/25/2012

### Trellis graphs in ggplot2

Trellis graphs are an informative way of visualizing relationships between variables conditional on other variable(s). In R, one can make trellis plots either through the `lattice` package, or `ggplot2`. In this post, I'll show how to make such graphs in `ggplot2`.

I'm still going to use the Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Suppose we want to plot the relationship between GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic), conditional upon the region (Region8a variable).

The more common way to do so is using the `lattice` package in R:

```library(lattice)
xyplot(Polity3 ~ log(GDP2000) | Region8a, data = norris.data,
xlab = "Logarithm of GDP (PPP annual, WB)",
ylab = "Polity Score (0=Full Authoritarianism, 20=Full Democracy)",
scales="free", pch=19, panel = function(x, y) {
panel.xyplot(x,y,pch=19, cex=0.5, col="black")
panel.lmline(x,y,col="red", lty=2)
})
```

The result is:

Note that the `scales="free"` option fits `xlim` and `ylim` for each region independently.  `panel.xyplot()` plots the points and  `panel.lmline()` fits the linear regression line for each region. Now, suppose we wanted to do this in `ggplot`. We are going to make use of the ` facet_wrap() ` command to create a trellis plot.

The syntax will be:
```ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_smooth(method="lm", color="red")
```

The result is:

One could remove the confidence intervals by including `se=FALSE` within the `geom_smooth()` function. One could also force the regression lines to extrapolate beyond the data by including `fullrange=TRUE` within the `geom_smooth()` function. Finally, to change the color of the confidence interval from the default gray, we can specify including `fill="blue"`, for example, within the `geom_smooth()` function. Note that in general, the `geom_smooth()` function is quite versatile and allows a variety of smoothing methods: you can specify `lm`, `glm`, `gam`, `loess`, or `rlm` (not exhaustive list) for the `method` argument.

An alternative, more flexible, way to do this is to first create a dataframe using the `plyr` package that contains the intercepts and slopes for the linear regression for each region. So instead of automatically passing down the aesthetics on to the `geom_smooth()` function as in the previous example, we can manually specify the aesthetics of the lines we want to draw using the `geom_abline()` function.

```library(plyr)
reg.df <- ddply(norris.data, .(Region8a), function(i)
lm(Polity3 ~ log(GDP2000), data = i)\$coefficients[1:2])
names(reg.df)[2] <- "intercept"
names(reg.df)[3] <- "logGDP2000"
```

Then we make the trellis plot as:
```ggplot(norris.data, aes(log(GDP2000), Polity3)) +
geom_point(shape=20, size=3) +
facet_wrap(~Region8a, ncol=4, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
geom_abline(aes(intercept = intercept, slope = logGDP2000), color = "red", data = reg.df)
```

A slightly different example of how to create trellis plots comes from the time series version of the Pippa Norris data. Suppose we want to look at the relationship between economic development and democracy over time just within easy country Central and Eastern Europe. One way to do so (as an exploratory tool at least) would be as follows:

```CEE.lattice <- ggplot(norris.ts.CEE.data[norris.ts.CEE.data\$Year>=1980,], aes(logGDP_WB, Polity1, color=Year)) +
geom_path(color="black") +
geom_point(shape=20, size=3) +
facet_wrap(~Nation, ncol=5, scales="free") +
scale_x_continuous("Logarithm of GDP (PPP annual, WB)") +
scale_y_continuous("Polity Score (0=Full Authoritarianism, 20=Full Democracy)") +
opts(title="Relationship between Econ. Development and Democracy in Central and Eastern Europe")
```

The result is:

Note that ` scale_colour_gradient(breaks=c(1980,1990,1995,2000), low="yellow", high="blue") ` colors the points with a gradient from yellow to blue based on the year. I manually set the breaks in the gradient. Also, note that I use ` geom_path() ` to connect the points using the order of the observations. This is different than the ` geom_line() ` command, which would connect the points in order of the x-variable, which is GDP in this case (and not time).

Finally, and this holds across all of `ggplot`, if we wanted to make the background layer of the plots white instead of gray, we could simply pass the following argument to `ggplot` as a layer:

```CEE.lattice + theme_bw()
```

The result is:

## 2/12/2012

### Better graphics in R: intro to ggplot2

I believe visualization of data is extremely important and not emphasized quite enough in the social sciences. At the very least, exploratory visualization of a data set should be a part of any thorough analysis, even if it doesn't make it into the final paper. On this note, I want to showcase the use of `ggplot2` as an amazing package for visualization in R. Although start-up costs are higher than using the base package for plots, the final product is better in my opinion (another common package for visualization of multivariate data is `lattice`).

Let's look at the "grammar" of `ggplot2`. The 2 basic types of input into `ggplot2` are:
• Aesthetics: x position, y position, shape / size / color of elements
• Elements / geometric shapes: points, lines, bars, text, etc.
A ggplot is invoked using the `ggplot2()` command. The aesthetics are typically passed to the plotting function using the `aes()` function. The logic behind `ggplot2` is that once the aesthetics are passed through, each type of element is plotted as a separate layer. The layering property of `ggplot2` will become apparent in the examples.

There is also a quick plotting function withing `ggplot2` known as `qplot()`. It is an easier place to start because it is meant to resemble the syntax of the `plot()` command within the base package in R. Let's start with one example of `qplot()` and then shift to `ggplot()` since the functionality is so much richer in the latter.

I decided to draw upon Pippa Norris' dataset "Democracy Crossnational Data" for this example. It has a variety of political, economic, and social variables for 191 countries. Specifically, given that I just re-read some arguments on the connection between democracy and economic development (see Przeworski et al., 2000), I will focus on GDP per capita in 2000 as the independent variable (transformed via natural logarithm) and the Polity score of democracy in 2000 as the dependent variable (0=most authoritarian, 20=most democratic). The variables are called GDP2000 and Polity3 in Norris' dataset, respectively.

We can create a simple scatterplot as follows:
```qplot(log(GDP2000), Polity3, data=norris.data,
ylab="Polity Score, 2000 (0=Authoritarian, 20=Democracy)", xlab="Log of Per Capita GDP (2000)",
pch=19, size=2, color=I("darkblue"), alpha=.75) + opts(legend.position="none")
```

This syntax should be quite similar to that of `plot()`, the only real difference being that we use `size` instead of `cex` to control point size, use an `I()` around the color name to indicate that it's a user-inputted constant (as opposed to a function of the data), and finally specify `alpha=.75` to create some transparency in the points. These are `ggplot` aesthetic names and they differ from those in `plot()`. Also note that we must add ` + opts(legend.position="none") ` to remove the legend that would otherwise show up to the right of the plot. The result is:

Now, we can make the exact same graph, but using `ggplot()`:

```plot2 <- ggplot(norris.data, aes(log(GDP2000), Polity3))

plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")
```

In this case, note how we first define `plot2` as the data (norris.data) and the variables which provide the x location and y location (part of the aesthetics). Then, we call `plot2` and graph the points, y-axis, and x-axis, all as separate layers. Note that we can pass layer-specific aesthetics to the plotting tools (point shape, size, and color, in this case). The result is the same as above.

We can add a regression line to the previous plot as yet another layer as follows:
```plot2 + geom_point(shape=20, size=4, color="darkblue", alpha=.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)")
+ scale_x_continuous("Log of Per Capita GDP (2000)") +
geom_abline(color="red")
```

Suppose we want to demarcate the regions in the world the countries are located in. We can make the color aesthetic a function of the Region8a variable in the data (factor variable with 8 regions) as follows:
```ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +
geom_point(shape=20, size=3) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)") +
scale_color_discrete(name = "Regions")
```

Note that specifically we passed `color=Region8a` as an argument to the initial `aes()` function. `scale_color_discrete(name = "Regions"` was invoked to change the name of the legend.

We can add country names just below the points in the previous plot as follows (again, separate layer):
```ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a)) +geom_point(shape=20, size=3) +
geom_text(aes(label=Natmap), size=2.5, vjust=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")+ scale_color_discrete(name = "Regions")```

If we just want country names and no points:
```ggplot(norris.data, aes(log(GDP2000), Polity3, color=Region8a))+
geom_text(aes(label=Natmap), size=2.5) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)")+
scale_color_discrete(name = "Regions")
```

Finally, another example of how we can make aesthetics of the plot a function of the data is by controlling the size of the plotted elements. Let's make the size of the points proportion to the amount of official development aid the countries received in 2002 as a proportion of their GDP (Aid2002 variable in Norris' data). We do so by specifying `size=Aid2002` in the original aesthetics:

```ggplot(norris.data, aes(log(GDP2000), Polity3, size=Aid2002)) +
scale_y_continuous("Polity Score, 2000 (0=Authoritarian, 20=Democracy)") +
scale_x_continuous("Log of Per Capita GDP (2000)") +
geom_text(aes(label=Natmap), color=I("black")) +
labs(size="2002 Aid as \n % of GDP")```

## 2/08/2012

### Python and GML/XML Parsing

For my inaugural post, I will show the power and versatility of Python to address a conceptually straightforward problem that was a pain to solve through other means. Specifically, I had geospatial data for Poland in GML format (essentially XML but for spatial coordinates). However, the GML schema was missing and an initial attempt to read the file using QGIS resulted in latitude and longitude being reversed (and Poland was consequently rotated and mapped onto the Arab peninsula).

I attempted to solve this a variety of ways: I tried to use the ogr2ogr function in the GDAL framework to convert the GML file to an ESRI shapefile as follows:

```ogr2ogr --config GML_INVERT_AXIS_ORDER_IF_LAT_LONG NO -f "ESRI Shapefile" -a_srs "EPSG:4326" gminy.shp admin_gminy.aspx.xml
```

Yet, it was still reversing latitude and longitude. An attempt to use FME - a proprietary software package for geospatial data conversion - also failed because it could not find the schema file.

In the end, the quickest solution was to write a quick Python script to parse the XML tree and exchange the order of the coordinates. I wish I had just started with this since it took all of 15 minutes to write and would have saved me a lot of time.

Note that the geographic coordinates in the xml format appear within relevant elements as:

```51.5174171815126 15.725049057718 51.5173922307042 15.7006014584274 51.5158010559858 15.6992569374652 ...
```

My goal was to parse these elements and exchange the order of the coordinates so that the example above would be:

```15.725049057718 51.5174171815126 15.7006014584274 51.5173922307042 15.6992569374652 51.5158010559858 ...
```

Here is the Python script, which touches upon how to parse XML using the Element Tree module:

```# import ElementTree module
from xml.etree import ElementTree as ET

# indicate xml file location

# open up file and parse xml tree using ElementTree module
tree = ET.parse(file_xml)

# if you want to iterate over all the elements of the tree and print them
# (warning: can be really long)
for t in tree.iter():
print t

# the elements of the tree that store geographic coordinates in this
#case have the tag '{http://www.opengis.net/gml}posList'
# this for loop will iterate over every element with relevant tag
for t in tree.iter('{http://www.opengis.net/gml}posList'):
# split string into list of coordinates
full_cord = t.text.split()

# create new list of NAs for reversed coordinates
new_cord = ['NA'] * len(full_cord)

# for each coordinate if it's odd we want to move it up by one position,
# if it's even we want to shift it back by one position
# this essentially reverses the order of latitude & longitude
for i in range(len(full_cord)):
if i%2==0:
new_cord[i+1] = full_cord[i]
if i%2==1:
new_cord[i-1] = full_cord[i]

# update xml tree element with reversed coordinates
t.text = " ".join(new_cord)

# export new tree