Interaction Terms in STATA
Tommie Thompson: Georgetown MPP 2018
In regression analysis, it is often useful to include an interaction term between different variables. For instance, when testing how education and race affect wage, we might want to know if educating minorities leads to a better wage boost than educating Caucasians. It’s possible that minority wages rises higher for every additional “unit” of education than it does for whites. To consider an interaction term, we simply create a new variable with the two terms multiplied together:
Wage = β0 + β1Education + β2Minority + β3Education*Minority + ε
β3 tells us the effect of education on hourly wage by race. If β3 > 0, then minorities earn more per hour than Caucasians for every additional unit of education they receive, controlling for the other predictors. This doesn’t mean that minorities have higher wages than whites (β2 tells us that), but that minorities derive more wage-generating value from education than whites.
Conducting analysis with interaction terms is straightforward in Stata. The most intuitive way to do so is to generate the interaction term as a new variable:
. gen RacexEduc = race*grade
(2 missing values generated)
. reg wage grade i.race RacexEduc
The output suggests that minorities gain 15 cents more per hour than whites for every additional year of education they receive, ceteris paribus, even though minorities make $2.47 less per hour than whites overall. Although the coding for this output is relatively painless, Stata offer a quicker way to run models with interaction terms using hashtags:
. reg wage i.race#c.grade
As the figure shows, if one hashtag is used, Stata runs a model only with the interaction term. That is:
Wage = β0 + β1Education*Minority + ε
Running a model like this however, is generally ill-advised. If we only include the interaction term without the main effects, then the observed effect of the interaction term might be masking the true effect from one of the main predictors. In other words, some of the effect we see from the interaction term may be from an independent main predictor “hiding” in the interaction term. But if we include the main effects, then we can see the pure relationship between wages and the interaction of education and minority status, since the model will hold the main effects constant in calculating the interaction coefficient. To include the main effects using hashtags, we can write them in as -reg wage grade i.race i.race#c.grade-. However, a simpler way is to use two hashtags:
. reg wage i.race##c.grade
While using hashtags is simpler than generating the interaction term as a new variable, there is a necessary rule to remember: use the variable prefixes. In Stata, -i.[variable]- indicates that the variable is categorical, and -c.[variable]- indicates a continuous variable. Because the hashtag code assumes the variables in the interaction term are categorical, it is necessary to define numerical variables as numerical with the -c.- prefix. The code above does this with the education variable. This might be somewhat counterintuitive to the overall regression syntax, as outside of interaction terms, Stata’s -regression- command assumes variables are continuous. If you forget to define your continuous variables however, you will either produce an unnecessarily long output or, if your numerical variable has decimals, an error:
. reg hours wage##i.race
wage: factor variables may not contain noninteger values
r(452);