Here’s a quick post on how to enable JIT (“just-in-time”) compiling in R. It’s very simple, and detailed in much greater depth in this excellent post by Tal Galili.

So first: why do we want to use a JIT compiler? Put simply, it will almost certainly speed up your code if you’re doing anything fairly strenuous. JIT compiling essentially compiles your R code into byte code just in time to use it. This will slow down the processing of the code for the first iteration (because there’s another step between telling R to execute the code and when R actually executes it), but speed it up for all following iterations. When you consider that functions you write yourself (as well as third party packages) can all be compiled, this will generally result in fairly significant performance increases. It’s also very simple. If you want to see the sort of speed increases you should expect, write a function whose speed you would like to measure:

f <- function() {
cf <- cmpfun(f)

That is, I have some function that will be repeatedly executed n times. You’ll want to make n large enough that it takes some fairly significant chunk of time – like maybe 20 seconds or so. You will very likely see an improvement through compilation. The first replicate will be faster when uncompiled, but every subsequent iteration will tend to be faster through compilation. If you’re using a lot of external functions in this test function, don’t expect miraculous speed increases. This is something else that will need to be fiddled with. I believe that the compiler function is now a part of base R, but you may need to install it. I don’t remember, offhand. You can find examples detailed in the compiler documentation.

To enable JIT compilation, it suffices to stick a couple of variables in your R environment (it’s a file called .Renviron and it should be in your home directory – that is, the directory in which R starts up by default). If you don’t have this file, you should create it. Then add the following two lines:


The first line, logically enough, directs R to compile external packages. The second line directs R to enable to JIT compiler. The 3 indicates the level of JIT compiling. From the R documentation:

JIT is disabled if the argument is 0. If enable is 1 then closures are compiled before their first use. If enable is 2, then in addition closures are also compiled before they are duplicated (useful for some packages, like lattice, that store closures in lists). If enable is 3 then in addition all loops are compiled before they are executed.

Basically, 3 is the level that does full JIT compilation. It’s also possible to enable these options from within an active R session (or should be), but as I recall, this didn’t work as I expected, somehow. It may be that packages loaded prior to enabling will not be compiled (for instance, base R functions – you know, the ones you probably use the most). In any event, this was the solution I liked best.

You’ll need to restart R for these changes to take effect.

I have just published a new version of `rdd’ to CRAN.

The main change is that multiple bandwidths may be more naturally estimated by passing a vector to the argument bw in the function RDestimate. If only a single bandwidth is passed to RDestimate (or none, allowing for the default Imbens-Kalyanaraman bandwidth calculation) then both half and double the initial bandwidth will be used for estimation. That is, if the call is RDestimate(Y~X,bw=5), then the RD will be estimated at bandwidths 5, 2.5, and 10. Robustness to varying bandwidths (and particularly to smaller bandwidths) should be a hallmark of a strong RDD, so I’m trying to make this a little more central to the output from the estimation commands.

Likewise, this means that certain other outputs now work differently. Most outputs are named the same, but they now return a vector of values, one for each bandwidth. So in the previous example, the est value of the object returned would be a vector of three values – the estimated discontinuity when estimating in bandwidths of 5, 2.5 and 10, respectively. For the model objects (if requested), they now reside in a list with the number of elements equal to the number of bandwidths used in estimation.

The summary method has been simultaneously reworked to allow for multiple bandwidths. If you are estimating a very large number of bandwidths, you should probably avoid using the summary command, as it must load in the model objects for every bandwidth used. In a future version, I will abstract farther away from the ivreg and lm objects, and provide all the necessary summary and fit statistics directly from the RD object. This should speed up the execution of the summary command, which is rather slow at the moment.

The next version will include a better summary function which does not require the model objects to be present in the RD object, as well as improved plotting methods for dealing with multiple bandwidths. I will also improve the workings of summary by providing a print method for it, rather than printing manually from within the summary.RD function (you’ll notice some ugly unintended output if you try to save the summary output, as in s.rd<-summary(rd) or similar).

I am also working on integrating randomization inference based methods (as well as window selection procedures) courtesy of Rocio Titiunik.

I will update the examples to reflect the changes to the package within the next few days.

I have recently completed a working beta of an R package for the regression discontinuity design with the help of Cyrus Samii. Moreover, it has been accepted to CRAN, so you can download it and mess around. I’ll briefly introduce the theory, and then continue with the some examples (there are also examples available as part of the package documentation, along with more full explanations of the available options). A more rigorous introduction to this software is currently being prepared for publication.

Most simply, rdd is a package for non-parametric estimation of sharp or fuzzy regression discontinuity designs. It works nearly identically to Austin Nichols’ `rd’ function for Stata. There are a number of things which should be noted, however.

Specification of the formula is a bit idiosyncratic, so be sure to look at the documentation. It is essentially specified as Outcome ~ Running_variable + Treatment_variable | Covariate_1 + Covariate_2, omitting terms that are unnecessary as the case may be.

Bandwidth can either be specified or it will be calculated automatically using the Imbens-Kalyanaraman optimal bandwidth procedure. The kernel used automatically is the triangular kernel, although other options are available. The cutpoint is assumed to be zero if nothing else is specified.

Justin McCrary’s routine for testing the density of the running variable for a discontinuity is also included (including a plot thereof).

Covariates are handled differently than in the Stata routine. Unlike Nichols, I chose to interact the covariates with treatment (in particular, a logical indicating above or below the cutpoint). This follows from some of the latest work done on the RD design. Interacting treatment with the covariates is likely to improve efficiency of estimation, and it requires no additional assumptions that aren’t already being made regarding smoothness of covariates across the cutpoint. If there is an outcry on this point, I am fully willing to make this choice an optional parameter. Although I feel that its likely that a strong design should be robust to this sort of thing.

There are many options for standard errors beyond the default Huber-White heteroskedastic robust ones. Most importantly, cluster robust SEs are included as an option (such as for use with a discrete running variable). Subsetting is handled, as are missing values. Internal model frames may be returned if desired by an option.

A useful addition is the plotting routine available in this package, which should automate the rather cumbersome process of plotting the discontinuity, particular in the case of dichotomous variables.

The workhorse function for estimation is RDestimate. Density tests can be performed with DCdensity. Methods are provided for summary and plot (for the RD object returned from RDestimate).

Further development will look to add non-parametric covariate adjustment as per Markus Frölich. This is likely still several months away, however. I am also open to any other additional features or improvements that might be desired. It is also possible that there may exist some bugs. Please report those to me, as well (that’s why it’s still labelled a beta)! Also, feel free to ask for clarification if the documentation doesn’t make something sufficiently clear.

Example analysis is available below the fold.

While working on a panel matching project, I came upon a portion of my code that was taking inordinate amounts of time. This post is about a pretty simple change that resulted in, for me, a gigantic performance increase. Initially, my code took about 100 minutes to run this one small section of code. This one simple change resulted in a runtime of around 100 seconds. Here’s basically what the code looked like:

lagit<-function(x) {
  var<-ts(x$var, start=x$year[1])

On its face, this code looks like it makes sense. I want to create a bunch of lagged variables in panel data, so I create the appropriate data frame for each dyad’s observations, and then combine them. Unfortunately, despite the many strengths of dataframes, they have a lot of overhead. If we simply move the data.frame to the outside of the “loop” (such as it is), then we make the massive performance gains. In other words, we combine all the panels and then create the data frame. This sort of makes more sense in the first place, so I’m not sure why I initially wrote it the other way. The following code is all that I changed to get the large speed increase I discussed at the beginning of this post:

lagit<-function(x) {
  var<-ts(x$var, start=x$year[1])

It wasn’t surprising that this sped up the code, but the_ extent_ to which it sped up my code was very startling. So my advice is to always be sure you mess with data frames as little as possible. They create a lot of overhead if you’re constantly converting things back and forth.

As of late I have been putting together some packages in R. There are a lot of great tools out there that make this easier (some of which, like Hadley’s devtools, I haven’t learned yet). I will provide a brief introduction to what I’ve been using so far to create packages. All of this is subject to change as I discover better tools (and fix any mistakes that may exist in this rundown).

The first thing to note are some of the broad strokes of what we need accomplished. As we know, R commands have handy documentation accessible via ?.  We need to create this documentation formatted in the proper way so that it has the same style as other R commands. We also need to ensure that all the proper dependencies are loaded. Likewise, we want to make sure that all the functions are accessible to the user that should be (and all those that shouldn’t be aren’t). We would like as much of this to be automatic as possible, too, and to allow for easy updating in the case of bugs or additions.

I’ll get into some of the zanier elements later, but to begin with, let’s talk setup. To begin with, you should have each function in a separate R file which corresponds to the name of the function. All of the files for the functions which you want to put into your new package should be in a new directory (and for simplicity, this should probably be a folder specifically for your package). I will assume that this is the case, and that your working directory is set (setwd) to this folder. The next thing we want to do is create the documentation for each function.

So this is the long and boring part where you write out all the documentation for each and every function that is accessible to the user of your package. One does this by using roxygen2, which is a wonderful little package. Essentially, R documentation exists in files with an .Rd extension. Rather than edit these by hand, we just throw in some markup to the header of each of our files, and roxygen handles the rest. It’s very nifty. The markup is very simple. I used a couple different sources of documentation on this markup. Primarily, I looked at the function documentation available here, a vignette available here, and from Hadley’s really wonderful book that is still in progress, available here. I will copy an example header from Hadley’s page for reference:

#' Order a data frame by its columns.
#' This function completes the subsetting, transforming and ordering triad
#' with a function that works in a similar way to \code{\link{subset}} and 
#' \code{\link{transform}} but for reordering a data frame by its columns.
#' This saves a lot of typing!
#' @param df data frame to reorder
#' @param ... expressions evaluated in the context of \code{df} and 
#'   then fed to \code{\link{order}}
#' @keywords manip
#' @export
#' @examples
#' mtcars[with(mtcars, order(cyl, disp)), ]
#' arrange(mtcars, cyl, disp)
#' arrange(mtcars, cyl, desc(disp))
arrange <- function(df, ...) {
  ord <- eval(substitute(order(...)), df, parent.frame())
  unrowname(df[ord, ])

That is, the first line is the title and the next little paragraph is a longer description. The param lines document the individual parameters of the function. The examples are working examples of how to make the function work. The export line should be included if you want the user to be able to use the function. If you don’t export it, the function will only be available for internal use. Some other possible tags include return, seealso, author, and various other options. The other option of note is include which will show that there exists a dependency on a particular package (or function from a package) in order for your function to work. This is obviously pretty important. While one could rig up some code for dependency checking, that is not the preferred mode of operation.

Assuming that the documentation is all written up properly, the next step is to create the package! One could do this all from inside the current R session, but I’ve found that this has some weird behavior associated with it, so we’ll instead do this with a batch script. If you’re on Linux, it’s very easy to cook up shell scripts that will do the job, but Windows presents a bit more trouble. I don’t particularly enjoy mucking around on the command line in Windows, so I wanted to create a simple batch file that wouldn’t require me to do so. I eventually worked out the following solution.

First off, we need to create a new R file. I called it roxy.R (for roxygen). This script will update our package’s source and documentation. It looks like the following:


This for a package called “mypackage”. For some reason, it is apparently necessary to load the methods and utils packages which are present in base-R. When calling Rscript from the command line, they aren’t loaded properly for some reason. I should also point out that there is an additional file here called “mypackage-package.R” which you have not yet created. This is a dummy file to provide package-level documentation. Here is what you might want to put in this file:

#' Title of your package
#' Longer description of your package which might
#' even take multiple lines.
#' @name mypackage-package
#' @aliases mypackage
#' @docType package
#' @author Your Name \email{}

Note the NULL at the bottom of the file. This is because roxygen requires that there be something in the file that isn’t just a comment. This just creates a package level ?mypackage command (from the aliases line). You might want to include things like a listing of all the provided functions, and a general overview of the purpose of the package.

The next part is to actually create the batch script which will be building your package. First I will include the full script I’m using, and then I will explain what all the lines do.

set PKG_VER=0.1
set PKG_VER_NOTE=First version. Include as long a description as you like, although linebreaks do not do well.

set R_SCRIPT="C:\Program Files\R\R-2.15.0\bin\Rscript.exe"
set R_TERM="C:\Program Files\R\R-2.15.0\bin\R.exe"
rm -rf mypackage/R mypackage/man
%R_SCRIPT% roxy.R

sed -i 's/\(Title: \).*/\1 Package Title/g' mypackage/DESCRIPTION
sed -i 's/\(Version: \).*/\1 %PKG_VER%/g' mypackage/DESCRIPTION
sed -i 's/\(Description:\).*/\1 Package description with lines broken up by \\n after which you should include a space./g' mypackage/DESCRIPTION
sed -i 's/\(Author:\).*/\1 Your Name/g' mypackage/DESCRIPTION
sed -i 's/\(Maintainer:\).*/\1 Your Name ^<^>/g' mypackage/DESCRIPTION
sed -i 's/\(License:\).*/\1 License name (^>=2.0) (See LICENSE)/g' mypackage/DESCRIPTION
sed -i 's/\(Collate:\).*/Depends: R (^>= 2.15.0), dependency1, dependency2\\n\1/g' mypackage/DESCRIPTION

echo --------------- >> mypackage\VERSION
echo Version: %PKG_VER% >> mypackage\VERSION
echo ----- >> mypackage\VERSION
echo Changes: %PKG_VER_NOTE% >> mypackage\VERSION
echo --------------- >> mypackage\VERSION

%R_TERM% CMD INSTALL --build mypackage

So there it is. Save this to a file with the extension .bat. For all of this to work, you need Rtools installed, as well as cygwin. This gives us more power on the command line by making UNIX-like commands work (sed, for instance). You also need these to be in your PATH. I’m not sure if these installations will automatically do this. If not, here is a pretty good explanation of how to set these variables for different versions of Windows (it’s specifically for Java, but you just need to add the appropriate directories). You need cygwin in your PATH, as well as Rtools. So you would just add “;c:/Rtools;c:/Rtools/bin” or whatnot on to the end of the string.

The first four lines just set some variables for later use. the first two are version specific. Here you can edit what the new version number of your package is, along with a brief description of changes. This will automatically be written to a changelog stored in VERSION (as seen in the lines towards the bottom that start with echo.  The other two variables should be changed to reflect the location of your R installation. I think I installed R with the automatic location settings, so you may not need to change this. You will, however, need to update this when you update your R version.

The rm bit removes the old versions of the documentation and source from the directory that the package will be built from. This seems like it shouldn’t be necessary due to the “force” switch in package.skeleton (in roxy.R), but it seems to be needed in my case, at least. The line that begins with %R_SCRIPT% simply runs the R script we created earlier to create and roxygenize our package.

All of the lines that begin with sed edit the DESCRIPTION file. I won’t go into exactly how sed works, but essentially, each of these lines searches for a line with some pattern, and then replaces that line with the pattern plus the bit you add onto the end. Because we aren’t in a god-loving UNIX-like shell, we have to add some additional quirky things in. For instance, if we want to print a greater than or less than symbol, we need to include a caret (^) first to escape it. These lines aren’t strictly necessary if you just want to edit the DESCRIPTION file yourself. What I suggest you do is to eventually just write out a nice DESCRIPTION file, copy it into your working directory, and then add in a line to your batch file “cp DESCRIPTION mypackage/DESCRIPTION”. You can then remove all of the sed lines except for the version number update.

Most of the lines changed in the DESCRIPTION file are pretty self explanatory, but the License line might merit a little more expansion. Choosing a license is one of these things that a lot of people get really worked up about (RMS, for instance). I’m not really sure how to approach it in my own case. I think that when I eventually publish my package(s), I will probably either go with the ubiquitous GPL, or the Apache license. I’m rather partial to Apache licensing insomuch as it doesn’t include all of the copyleft provisions in GPL (and especially GPLv3). I’m not really interested in dealing with all of that, so something like Apache (which seems to be a very agnostic sort of open source licensing) seems to make a lot of sense. On the other hand, though, most R packages seem to use GPL, so there may not be much of a reason to buck the trend.

The line that begins with %R_TERM% packages up our package and saves it as The final line pauses the batch file so that the DOS prompt stays on the screen so that we can inspect the output to ensure that there were no errors.

And you’ve done it! Your package should have been generated correctly, with all the proper documentation and whatnot! You can then install it to test with install.packages(“”,repos=NULL) for whatever version number you are on.

So that’s the development workflow I have put together. Feel free to use whatever makes sense to you. I didn’t really see a very good centralized repository of an entire development workflow, so I’m putting my own out into the aether.

(EDIT: I’ve made vast changes in my workflow through the use of Hadley’s fantastic devtools package. It’s still not completely obvious how to manage the whole workflow, so I’ll be making a new post on the subject soon.)

I have recently been working on writing a package for R, and in the course of doing so wanted to run some fairly large simulations. I have a powerful multi-core processor on my desktop, but R doesn’t care about those extra cores. I wanted to run 5000 repetitions of my simulation function, but doing so takes a couple minutes, which is clearly unacceptable. The output of system.time on replicate(5000,sim(1000)) returns an elapsed time of around 400 seconds. This seems ridiculous. I try to write well behaved vectorized code (I mean, replicate should be trivial to multithread, right?), so there is no reason that we shouldn’t be able to massively parallelize this code.

This is indeed the case. The package we are interested in is “snow.” This is an acronym for “Simple Network of Workstations”. As the name implies, it was written for clustering, but we don’t have to use it like that. Let’s say I just want two “nodes” (which just correspond to threads, in this case). What do we need to do?

First, we should initiate the “cluster”, by simply creating two nodes on the local machine. We do that as follows (Windows may pop up a firewall warning. I’m not sure how this will work on other OSes):


There are other sorts of clusters, but SOCK clusters are easy to set up, so I’m just using that. Next, we need to set up the environment on each of these newly created nodes.  For my simulation, I need the sandwich and the lmtest packages attached, so I do that:

clusterEvalQ(cl,setwd("[Your Working Directory Path]"))
clusterEvalQ(cl,source("[Source File]"))

As you can see, I am calling a function of my own creation in the simulation, so I need to source that file to each of the new R environments I created. The final step is to just tell the cluster to get to work:


And its really that simple. There doesn’t seem to be a direct equivalent to sapply in the snow library, but this isn’t difficult to deal with. The parSapply function operates just like the normal sapply function in base R (and there are other similar functions which you can find in the snow documentation). Clearly, calling this function as I do will do exactly the same thing as replicate(5000,sim(1000)). stopCluster just stops the cluster and cleans up the connections.

Calling this through system.time calculates that the elapsed time is just a bit more than half (about 225 seconds) of the elapsed time of the single threaded version. This isn’t surprising, as the clustering method requires some amount of overhead, so we shouldn’t expect time to be exactly cut in half.

There may be a cleaner way to do this that feels like less of a hack, but I’m pretty pleased with this. It took me a very small amount of time to figure out and get working.

Some people seem to have trouble figuring out the best way to create high quality extensive game trees in LaTeX. I’ll admit that it took me a while to figure out, at first. There are lots of different packages that purport to do this sort of thing, but I have found TikZ to do the best job of any LaTeX based tool. The nice thing is that gaining a solid understanding of TikZ will let you create all sorts of fancy looking charts and pictures in your documents, as TikZ is a really versatile tool for creating all sorts of figures in LaTeX documents. There’s a great manual with everything you could ever want to know about TikZ, and also an excellent site with examples.

It might be a little intimidating if you aren’t used to this sort of thing, but its much easier to get a handle on than a package like pstricks, which is incredibly frustrating to get working with pdf output.

Anyway, an example follows. First, the output:

And the code to create this:

%Put this in your document preamble:
%And then the actual code for creating the figure:
% First, set the overall layout of the tree
% You might need to play with these sizes to ensure nothing overlaps.
\tikzstyle{level 1}=[level distance=1.5cm, sibling distance=2.5cm]
\tikzstyle{level 2}=[level distance=1.5cm, sibling distance=2.5cm]
\tikzstyle{level 3}=[level distance=1.5cm, sibling distance=1cm]
\tikzstyle{level 4}=[level distance=1.5cm, sibling distance=2cm]
%Start with the parent node, and slowly build out the tree
% with each "child" representing a new level of the diagram
% each "node" represents a labelled (or unlabeled if you 
% want) node in the diagram.
\node {1a}
                        %Put the name of the node in parenthesis for
                        % reference later. The label shown in the diagram
                        % goes in the brackets. This label can use math mode.
                                %This allows us to attach a label to the 
                                % edge between nodes. This label is just 
                                % another node, so we can also name it and
                                % attach things to it.
				edge from parent
				edge from parent
                  %Invisible branch to make things align properly.
		} child{edge from parent[draw=none] } 
	edge from parent
				edge from parent
				edge from parent
		edge from parent
				edge from parent
				edge from parent
		edge from parent
	edge from parent
%Now I create the information set. Note that I utilize the names
% that I had previously assigned to nodes in my graph
\draw [dashed](a)--(b);
\draw [dashed](b)--(c);

Here’s a neat little function to plot coefficients in ggplot2, courtesy of David B. Sparks. The function is below.

SmoothCoefficientPlot <- function(models, modelnames = "", removeintercept = FALSE){
  # models must be a list()

  Alphas <- seq(1, 99, 2) / 100

  Multiplier <- qnorm(1 - Alphas / 2)
  zzTransparency <<- 1/(length(Multiplier)/4)
  CoefficientTables <- lapply(models, function(x){summary(x)$coef})
  TableRows <- unlist(lapply(CoefficientTables, nrow))

  if(modelnames[1] == ""){
    ModelNameLabels <- rep(paste("Model", 1:length(TableRows)), TableRows)
    } else {
    ModelNameLabels <- rep(modelnames, TableRows)

  MatrixofModels <- cbind(, CoefficientTables), ModelNameLabels)
  if(removeintercept == TRUE){
    MatrixofModels <- MatrixofModels[!rownames(MatrixofModels) == "(Intercept)", ]
  MatrixofModels <- data.frame(cbind(rownames(MatrixofModels), MatrixofModels))

  MatrixofModels <- data.frame(cbind(MatrixofModels, rep(Multiplier, each = nrow(MatrixofModels))))

  colnames(MatrixofModels) <- c("IV", "Estimate", "StandardError", "TValue", "PValue", "ModelName", "Scalar")
  MatrixofModels$IV <- factor(MatrixofModels$IV, levels = MatrixofModels$IV)
  MatrixofModels[, -c(1, 6)] <- apply(MatrixofModels[, -c(1, 6)], 2, function(x){as.numeric(as.character(x))})
  MatrixofModels$Emphasis <- by(1 - seq(0, 1, length = length(Multiplier) + 1)[-1], as.character(round(Multiplier, 5)), mean)[as.character(round(MatrixofModels$Scalar, 5))]

  OutputPlot <- qplot(data = MatrixofModels, x = IV, y = Estimate,
   ymin = Estimate - Scalar * StandardError, ymax = Estimate + Scalar * StandardError,
   ylab = NULL, xlab = NULL, alpha = I(zzTransparency), colour = I(gray(0)), geom = "blank")
  OutputPlot <- OutputPlot + geom_hline(yintercept = 0, lwd = I(7/12), colour = I(hsv(0/12, 7/12, 7/12)), alpha = I(5/12))
  OutputPlot <- OutputPlot + geom_linerange(data = MatrixofModels, aes(size = 1/Emphasis), alpha = I(zzTransparency), colour = I(gray(0)))
  OutputPlot <- OutputPlot + scale_size_continuous(legend = FALSE)
  OutputPlot <- OutputPlot + facet_grid(~ ModelName) + coord_flip() + geom_point(aes(x = IV, y = Estimate), colour = I(gray(0))) + theme_bw()

It’s fairly easy to modify to work with robust SEs. Merely change the line which extracts the coefficients from:

#Original line
CoefficientTables <- lapply(models, function(x){summary(x)$coef})

#Modified for Robust SEs

This assuming that the variance covariance matrix is saved to the $se element of the model, as I suggested in the previous post. It also requires the lmtest package.

Other modifications are pretty easy. I may end up fixing it up to be a little more user friendly by adding some possible arguments to the function call (for things like robust SEs, modified variable names, etc). If I do, I will post it.

Anyway, it gives a nice looking plot of the coefficients without using an arbitrary cutoff for statistical significance.

So I think I’m going to move into more of an ad hoc format for these blogs. Much easier to manage betwixt classwork and other projects.

Anyway, the subject of this post is on various sorts of heteroskedasticity robust standard errors in R. In Stata, this is trivially easy: you simply add “, robust” to the end of your regression command (or , cluster(var) for cluster robust SEs). In R, there’s a bit more flexibility, but this comes at the cost of a little added complication. Simplest first.


This code generates the default heteroskedastic robust standard errors in R. The sandwich package is necessary for reestimating the variance–covariance matrix, and the lmtest package for easily performing summarizing tests on the coefficients using the corrected variance estimation. If you don’t need to retest the model’s output, you might not need lmtest. First, I should flash the appropriate matrix notation for the asymptotic distribution of our regressors. In particular, the variance may be written as follows:

With fixed regressors, we can rewrite this more simply as:

where the bit of interest is . If we can get away with assuming homoskedasticity, its trivial to see that . When we can’t assume homoskedasticity, we can’t simplify so easy, and we must come up with a smart way to estimate . Which is exactly what heteroskedastic robust SEs try to do!

By default, vcovHC will return HC0 Standard Errors. That is,


This may cause some confusion though, especially if you are surrounded by people using Stata. This is not the same robustness procedure Stata’s robust option uses. Instead, they use HC1 robust SEs, which include a degree of freedom correction:


To get this output in R is very simple. Simply add the option “type” to vcovHC with the argument “HC1”. There are several other ways to estimate robust SEs which I won’t address right now.

The main point of this post was actually to get at cluster robust SEs, however. This is when you believe that errors are correlated within certain groups, but not between groups. An example would be in panel data. You would likely expect errors to be correlated for a given state (and you would probably also use fixed or random effects in this situation. Fixed effects and cluster robust SEs go together like milk and Oreos). Another example would be if you are trying to estimate a treatment effect, but you weren’t able to randomize treatment at the individual level. Perhaps you were only able to randomize over certain groups (perhaps by neighborhood, county, or family).

In this case, we want to correct for the fact that certain groups are correlated. We do this using the following estimator:

Where the variables h represent clusters/groups, and H is the number of clusters. Stata also applies a degree of freedom correction, however, so it use the estimator:

But we don’t particularly care about how Stata does things, since we want to know how to do it in R. Mahmood Arai provides a function for doing so, which I have modified along the lines of J. Harden (although his version does not work directly as written). My version is as below (it uses the same degree of freedom calculation as does Stata): <- function(model, cluster){
 M <- length(unique(cluster))
 N <- length(cluster)
 K <- model$rank
 dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
 uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
 rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N) <- coeftest(model, rcse.cov)

To execute this function requires the existence (again) of both the sandwich and lmtest packages, available on CRAN. After running a model, simply call the function with the saved model output and the variable on which you wish to cluster, as in below.

#To save only the variance-covariance matrix
#To save the coeftest object,
# which is a table containing the results
# of a test of the coefficients for significance
# using the corrected SEs

I personally like attaching the variance covariance matrix to the model object in, for instance, the element $se. This allows for easy interoperability with the package apsrtable.

A word of warning on cluster robust standard error calculation, however. It is necessary that you are careful in the crafting of the cluster variable. The factor variable passed in should have levels which are the same as the names of the associated dummy variables in your data frame. For instance, I recently was working with a dataset which required state-level clustered SEs (but did not use state level fixed effects). I had first added the state level dummy variables using the dummies package, generating variables named in the format of “state.washington, state.georgia” and so forth. My factor variable on which I wanted to cluster had levels named “washington, georgia” and so on. This will return an error, as the estimation routine cannot find the appropriate dummies in the data frame (and the error isn’t particularly clear as to the specific problem). There are two ways to fix this issue directly (and possible an easier fix of which I am unaware.). You can rename all of the variable names in your data frame to match the factor levels, or you can just create a new variable with the appropriate variable names as factors. I chose the latter. This may easily be accomplished with the following code:

#Create the new variable with appropriate level names.
#Save the coefficient test output to an element in the model object

This was something that confused me for a while, but once you realize how the functions are operating under the hood is a no-brainer. I did want to get this fix up on the web, though, as it stumped me for a while, and I’d imagine other people have been in a similar situation. To be honest, I was nearly to the point of writing an entirely new estimation routine from the ground up, just so I knew exactly what was going on beneath the layers and layers of function calls in the function.

(Changing things around and not discussing average treatment effects yet. As I begin to get settled in with this, hopefully I should get a better outline of the order in which I wish to proceed. For now, I’m sure I will change my mind a lot.)

Causal inference is not always possible. We can’t make causal claims about all varieties of effects. In the last post I briefly introduced the concept of potential outcomes. And important element of the very concept of potential outcomes is that, prior to the moment of treatment, it was possible for either outcome to be realized. If we can’t supply a counterfactual, we can’t really talk about causal effects.

Consider race. Can we think about identification of a causal effect of sex on incomes? If we imagine “sex” to be our treatment, can we really conceptualize gender as an intervention? Rubin argues that we cannot. Even a procedure which could change the sex of the child at birth would not be sufficient to identify the actual effect of gender. How could we tell whether the effect was due to the procedure or due to the sex of the child? Thus, it is necessary that, for any unit under study, we are able to imagine a scenario wherein the intervention could have been different. For this reason, Rubin suggests, we should conceive of treatment as a set of actions performed on an individual. Then we can unambiguously discuss the causal effects of those actions.

My final note before actually moving on to a discussion of averages as the basis of treatment effects will be around what I find to be one of the most compact and profound equations in causal inference.

Consider the joint probability density function of $(X, Y, W, M)$ with unknown parameter $\pi$. $X$ is a collection of all the pre–treatment variables. Let’s consider an example of conflict. We wish to measure the causal effect of democracy on conflict propensity. Then $X$ would consist of such variables as economic growth, colonial legacy, resources, trade flows and so forth. $Y$ is our outcome variable, which is, of course, measured post–treatment (but we are in the potential outcomes framework, so there exists a potential outcome for both democracy and for non–democracy treatments). $W$ is an indicator which shows what treatment was received by the unit. Finally, $M$ is a matrix signifying the missingness of all the previous variables. We can think of the joint probability density function of these variables as showing the observed data on this subject. Rubin demonstrates a wonderful decomposition of this:

The first part of this expression is essentially what we care the most about. $f(X,Y \vert \pi)$ is the potentially observable data. In other words, the combination of covariates and potential outcomes are represented here. The rest of the expression is all about the design. $k(W \vert X,Y,\pi)$ is the probability distribution of the assignment to treatment. We need to understand this mechanism if we want to draw appropriate conclusions from our study. Likewise, the final element of the expression, $g(M \vert X,Y,W,\pi)$ is the assignment of missingness to the data. Like the treatment, we have to understand this mechanism if we want to be able to draw inferences about the underlying process of data generation.

For observational studies, the researcher does not have control over the latter two portions. That’s why it is very appealing to design an experiment of sorts, rather than rely on collecting existing observational data. In so doing, you can gain a much better appreciation of exactly how the design portion of this distribution behaves. It makes things worlds easier. To bring things back to my example, we need to have a clear understanding of how it is that states are “assigned” to democracy (not exactly a straightforward task), as well as a good understanding of where there is data and where there isn’t. For example, autocratic states may try to censor details about the oppression of their citizenry. This oppression likely also has an effect on the democratizing process. Clearly, this is not an easily identified causal effect.

Nevertheless, I really like this expression, as it neatly ties together many of the things that empirical researchers really have to worry a lot about: the data generating process, the assignment to treatment, and the missingness of the data. It’s a really profound expression to keep in mind.

I think my next post will discuss the role of randomization and the use of averages to help overcome the fundamental problem of causal inference.