We work with clients who are running poverty interventions in the field. These are not experiments – where the program is designed around an experiment – these are impact evaluations that have to be designed around a development program that has already been designed or is in progress. As such, often times the sample size is fixed as the number of beneficiaries and where they are located has already been determined. One specific challenge is when the program is offered at a community level (or has a high likelihood of spillovers). These studies would need the randomization into the program at the community level (so a community is assigned to treatment – rather than an individual).
A few terms before we move into tools:
- I’m going to call program beneficiaries “Treatment” and those who don’t get the program “Control” (called “Comparison” when not randomly assigned)
- “The power (or statistical power) of an impact evaluation design is the likelihood that it will detect a difference between the treatment and comparison groups, when in fact one exists.” -3ie; more on this here
- “n” is the shorthand/notation used in statistics to reference a sample. The entire group -population- that the sample was drawn from in referenced with a capital “N.” In an experiment (this can get confusing) the treatment group size and the control group size can be referenced with “n” while the entire study’s sample size may be called “N.”
- Cluster-Randomization is random assignment to treatment (or control) at a level other than individuals. This is common when spillovers (non-beneficiary receives a benefit from the program). An example of this could be where a study on educational methods are tested in primary schools. We we cannot randomize students into the program because a teacher’s work will impact all of her students; not just the one student in the room you assigned to treatment. In this case we would randomly select schools (not students) to be in the program. Note that our unit of observation would still be the student (maybe we use their test scores).
In cluster randomized studies power can be a challenge. The ability to detect a program effect is driven more by the number of clusters (school) in the study than it is the number of observations (students). This means that if you had 10 schools in your experiment (5 treatment; 5 control), and you observed 25 students in each school, you would gain a lot more power by adding 2 more schools (20% increase in clusters and resulting observations/students) to the study than you would from including 5 more students from each school (a 20% increase in observations).
The graph below shows how small marginal increases to power are from adding additional students in blue. The slope of the red curve is much steeper though the small numbers – signaling greater marginal increases in power.
If there were only 10 schools – a study would need hundreds of students (or more) to hit the 0.80 power level that social scientists traditionally target. However, with 20 schools we reach that target by observing about 25 students per school. The key takeaway in this chart is that when you go from 10 to 20 clusters you see the blue power curve shift upwards dramatically. However, when we double the number of observations per cluster – the red power is not impacted as much.
Ok -so that’s just the reality check. When the randomization is clustered – we need to add clusters not observations to increase our power. But what do we do when we still have a small number of clusters and operations won’t allow for more? What’s small? For this conversation, let’s say less than 20 (10 Treatment; 10 Control). We are at the threshold of having enough power, but we are concerned and want to squeeze a little more into the study. How can increase our bang for the buck – so to speak…
In 2016 John Deke from Mathematica published a great paper looking into this: “Design and Analysis Considerations for Cluster Randomized Controlled Trials That Have a Small Number of Clusters.” Perfect! He demonstrates that it’s possible -in specific scenarios- to detect effects with as few as six clusters. Here are the actionable takeaways, “Pairwise matching is always recommended” and “When feasible, a ‘gains’ analysis can be the most statistically powerful way to adjust for a baseline measure of the outcome.” There are a lot of other great findings, but another one of interest is that you should control for the matching by including the variable(s) used to created the pairs rather than the dummies for the pairs themselves. This saves degrees of freedom (assuming you have less matching variables than you do matched-pair dummies).
In our school example, if we were intersted in the treatment effect on test scores, we would create a gains score variable where Gains = (Score_Endline) – (Score_Baseline). A positive score would mean that the student increased her score from baseline to endline; a negative score would mean she was backsliding. Normally, we would have put the baseline score in the regression as a covariate. This adjusts for student’s pre-treatment scores (or initial state). Subtracting it out does the same thing, but reduces the number of variables in the model.
This is a good time to make a note about ANCOVA. If you’re not struggling a small number of clusters – and worried about degrees of freedom, the standard practice now should be to use an ANCOVA model for estimating your program’s impact. In 2012, World Bank Economist David McKenzie published a paper exploring adding more rounds of follow-up data. The paper also clearly demonstrated the value in using ANCOVA estimates rather than a diff-in-diff. This is a great tool to add free power to your study! It’s also very simple, just include your baseline measure of your outcome in the regression as a covariate. Done.
Don’t take my word for it – Berk Özler, blogged about this in 2015 too. Then, McKenzie pointed out another benefit of the method: it allows for different measurements of the outcome at baseline vs endline (maybe you improve the survey tool – or change a recall period).
McKenzie also provides a lot of insight on matching. Bruhn and McKenzie discussed some ways to improve the way we design and analyze RCTs in their 2009 “In Pursuit of Balance” paper. They look at a number of ways to randomize, and conclude that when your sample is 300 (individually randomized) your method of randomization is less important. However, in smaller samples they also push for pair-wise matching ex ante (before hand) on variables that have, “good predictive power for the future outcomes.” This is mainly to help ensure balance, but that in turn has benefits to our power. The added value of this paper is the digital annex (ungated). It includes STATA code to create the pairs!
If you find yourself trying to add a bit of power to a less than perfect randomized trial – consider these steps:
- Add More Clusters
- Use ANCOVA (or Gains if you’re worried about degrees of freedom)
- Use Pairwise Matching