Introduction
It is often the case in Impact Evaluation that we have a need to analyze binary, qualitative variables such as savings behavior (saves vs. does not save), voting behavior (votes vs. does not vote), or gender (male vs. female). In general, we are interested in whether a condition exists or does not exist, rather than trying to estimate some interval measure. One method for analyzing qualitative, binary variables is Linear Probability Models (LPM). A LPM is a special case of Ordinary Least Squares (OLS) regression, one of the most popular models used in economics. OLS regression aims to estimate some unknown, dependent variable by minimizing the squared differences between observed data points and the best linear approximation of the data points. Certain assumptions are required for this model to be valid. *WARNING*: This blog post contains some mathematical language which may be intimidating to some. While this language is necessary to properly define the relevant topics, I will attempt to provide an explanation where possible.
Assumptions
Assumption #1. The dependent variable $y$ is a linear combination of the regression coefficients $\beta$, the independent variables $X$ and an error term $\epsilon$.
Assumption #2. The matrix of explanatory variables $X$ has full column rank. This means that the regressors in $X$ are linearly independent and the number of observations is greater than the number of explanatory variables.
Assumption #3. Explanatory variables $X$ have strict exogeneity. No regressor in our model may explain our error term $\epsilon$. Mathematically, we say $E[\epsilon_i|X]=0$. In real terms, this is achieved by randomization of treatment assignment.
Assumption #4. Error terms are independent with identical finite variance $\sigma^{2}$. This condition of identical finite variance is called homoscedasticity.
Assumption #5. Error terms are normally distributed conditional on the response variables $X$. This assumption is technically not required for OLS regression to be valid. However, normally distributed error terms are convenient for small sample sizes and allows for hypothesis testing.
The Probability in “Linear Probability Model”
As was stated earlier, a LPM is a special case of OLS regression. For our purposes here, we are interested in estimating a qualitative outcome variable $Y$, which can take on two possible values, usually $0$ and $1$. Thus we have,
Also recall that from OLS, if we have $p-1$ dependent variables $X_1,X_2,…,X_{p-1}$, then
By combining equations $(1)$ and $(2)$, we have
Thus, the left side of equation $(3)$ forces the right side to be interpreted as a probability. Remember that probabilities range between 0 and 1, inclusive. Suppose our regression produces a value of 0.85 on the right side. Then we could say that 85% of individuals displaying that particular combination of regressors would fall into category 1, and the remaining 15% into category 0. At this point, an example would be useful to visualize how LPM’s work with real data and identify some of their drawbacks.
Example
For this example, we’ll use the auto.dta dataset which comes packaged with STATA 14.
. sysuse auto.dta (1978 Automobile Data) . drop make headroom trunk length-gear_ratio . sum Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------- price | 74 6165.257 2949.496 3291 15906 mpg | 74 21.2973 5.785503 12 41 rep78 | 69 3.405797 .9899323 1 5 weight | 74 3019.459 777.1936 1760 4840 foreign | 74 .2972973 .4601885 0 1
The summary shows price
, mpg
, and weight
, which are continuous variables; with rep78
discrete 1-5. The variable rep78
represents the number of times that a particular vehicle had been repaired in 1978. foreign
is a binary variable, with 1 indicating that the vehicle is foreign. We will use LPM to find the probability that a vehicle is foreign based on the linear combination of price, fuel consumption, repair record, and weight.
. reg foreign price mpg weight rep78 Source | SS df MS Number of obs = 69 ----------+------------------------------ F(4, 64) = 31.90 Model | 9.7285922 4 2.43214805 Prob > F = 0.0000 Residual | 4.88010345 64 .076251616 R-squared = 0.6659 ----------+------------------------------ Adj R-squared = 0.6451 Total | 14.6086957 68 .21483376 Root MSE = .27614 ----------------------------------------------------------------------- foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+-------------------------------------------------------------- price | .0000592 .0000144 4.11 0.000 .0000304 .000088 mpg | -.0195018 .0097679 -2.00 0.050 -.0390155 .0000118 weight | -.000538 .0000782 -6.88 0.000 -.0006942 -.0003819 rep78 | .1501357 .0391534 3.83 0.000 .0719178 .2283536 _cons | 1.475485 .4222043 3.49 0.001 .6320355 2.318935 -----------------------------------------------------------------------
Our regression results yield the following formula for predicting the probability that a particular vehicle is foreign:
Suppose we knew that a vehicle cost $8,000.00, averaged 24 mpg, weighed 2,500 lbs, and was repaired once. Then our results indicate that:
In other words, a vehicle with these particular traits in 1978 has a 71% chance of being foreign. Now we’ll look at a plot of the residual values against probabilities for all observations in our dataset.
. rvfplot
Violation of Assumptions
This plot reveals a few problems with OLS regression for binary outcome variables. First, recall that residuals ($y-\hat{y}$) must be normally distributed. Above, we see that, for each $\hat{y}$, the residuals are free to take on only two values. Thus, we can conclude that the error terms $\epsilon$ are not normally distributed in the population; violating assumption 5. Residuals should also have identical finite variance, meaning that the amount of error is the same no matter what the value of $Y$ is. However, the variance of a binary variable is $pq$, where $p$ is the probability that the condition exists, and $q$ is the probability the condition does not exist. Because $Y$ can only be 0 or 1,
Thus, the variance of our errors changes as $p=Pr(Y_i=1)$ changes since $p$ is dependent on our predictors $X$; violating the homoscedasticity property of assumption 4. Notice from the plot that there are values of $\hat{y}$ that fall outside the range of [0,1]. This is because OLS regression places no constraints on the range of $\hat{y}$, letting them range between negative infinity and infinity. Since we have probabilities less than 0, this would indicate that there is a problem with the credibility of our model. Restraints could be placed on the estimated coefficients in order to keep predictions with the range [0,1], but then we would not necessarily have a linear relationship; violating assumption 1.
Summary
Because of the pitfalls above, great care must be taken to ensure that the “true” relationship between $Y$ and $X$ is linear. Suppose our model specification was correct, then adoption of a Weighted Least Squares estimate could alleviate our heteroscedasticity problems. If a linear relationship cannot be assumed with reasonable certainty, then an alternative model would be desirable such as logit or probit.
Citations
Aldrich, J. H., & Nelson, F. D. (1990). Linear probability, logit, and probit models. London: Sage.
Assumptions of Classical Linear Regression Models (CLRM). (2015). Retrieved June 23, 2016, from https://economictheoryblog.com/2015/04/01/ols_assumptions/
Special thanks to Michelle Norris, PhD at California State University – Sacramento for editing assistance.