Looking at LASSO - parameter estimation in a contrived example
16 Jul 2016I’ve been thinking about LASSO a lot over the last few months. I first heard about LASSO over at Gelman’s blog a few years ago (I can’t remember the exact post), but didn’t follow most of the discussion or spend much time trying. I didn’t really understand what the fuss was about until last semester, when my econometrics professor showed me some papers in spatial econometrics using LASSO (this one by Elena Manresa and this one by Cliff Lam and Pedro Souza). Going through those posts again, regularized regressions are now the coolest thing to me since the HydroFlask.
My dad and I visited some relatives in northwest Karnataka last week where I had limited internet/distractions, so I finally threw together a LASSO simulation I’ve been thinking about since that econometrics class. My goal was to see how LASSO’s parameter estimation compares to OLS’s. I don’t have any new results; everything here has been derived or discussed in greater detail somewhere else. This post is to convince myself of some of LASSO’s properties relative to OLS without doing derivations, to get familiar with the glmnet
package, and to procrastinate on other work I should be doing.
What is the LASSO?
The idea with OLS is to estimate the model \(Y = X \beta + \epsilon\) by solving
\[\min_{\beta} (Y - X \beta)^2\]This results in an estimator for \(\beta\) which is unbiased, consistent, and even efficient under certain conditions. It’s easy to compute and interpret, its assumptions and inferential properties are well-studied… it’s pretty much the first model taught in a lot of econometrics classes and the first tool applied economists reach for to estimate things.
The idea with LASSO is to add a penalty to the problem that is increasing in the dimension of the coefficient vector \(\beta\), i.e. to solve
\[\min_{\beta} (Y - X \beta)^2 + \lambda \| \beta \|_1\]where \(\| \beta \|_1\) is the \(L1\) norm of the vector \(\beta\). \(\lambda\) is a parameter that controls how severe the penalty for a higher-dimensional or larger \(\beta\) is - bigger \(\lambda\) means a sparser model. It is typically tuned through cross-validation. This procedure results in a biased but consistent estimator of \(\beta\).
Note that the consistency described in the Zhao paper at the “consistent” link is not the same consistency that econometricians usually talk about; the former is model selection consistency, the latter is parameter estimation consistency. The Zhao paper has some discussion of this distinction, and refers to Knight and Fu (2000) for proofs of estimation consistency and asymptotic properties of a class of regularized estimators which include LASSO as a special case.
That’s all I’ll say about LASSO here. There’s a lot of discussion of the LASSO online. Rob Tibshirani has a good introduction with references here, the wikipedia page is also decent, and googling will turn up a lot more resources.
Simulations
I used R’s glmnet
package to run the LASSO estimations. My main reference was this vignette by Trevor Hastie and Junyang Qian, supplemented by some searches that I can’t remember now.
DGP and code
This is how I made the data generating process:
- draw 100 “seeds” \(\omega_i\) from a uniform distribution over \([0,1]\)
- sample
n_covs
of these seeds to generate sines \(X_i = \sin(\pi \omega_i t)\), where \(t\) runs from 1 ton_obs
- put these sines together in a matrix \(X\), and generate a dependent variable \(Y = X_1 + 3 X_2 - 5 X_3 + 7 X_4 - 9 X_5 + 11 X_6 + R\), where \(R \sim N(0,1)\) is a standard normal random variable to add a little noise
This gives me a bunch of randomly initialized deterministic regressors and a noisy linear combination of a few of them. Since OLS is unbiased and consistent, it should estimate the coefficients of the \(X_i\)s correctly. Since there are only 100 observations it probably won’t be super precise but on average it should still be close.
For each number of covariates, I estimated OLS and LASSO regressions 1000 times from steps 2 and 3, storing the coefficients in a matrix. My R code below might make this process clearer.
The coefficient storage could have been done more elegantly and I don’t like those rbind()
calls… but effort is costly and this is produced under scarcity.
Anyway, the function lasso_ols_mc()
takes three arguments: n_covs
, the number of covariates to generate; n_obs
, the number of observations to generate; n_iter
, the number of iterations to run. It returns a list with two dataframes of dimension n_iter
by n_covs
: the first is the OLS coefficients, and the second is the LASSO coefficients. The code is in the file simulation.r
here.
I’m using cv.glmnet()
instead of glmnet()
to get the estimated coefficients from the penalization parameter \(\lambda\) which minimizes the cross-validated errors. This Stack Exchange post has a good discussion of the use of cv.glmnet()
versus glmnet()
and how that relates to \(\lambda\).
Simulation 1: 25 covariates
The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. OLS and LASSO are both close on the first 6, with LASSO a little closer on the rest.
Simulation 2: 50 covariates
The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. OLS is being a little silly now… LASSO is close on the first 6, and solid on the rest.
Simulation 3: 75 covariates
The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. Go home OLS, you’re drunk.
Discussion
From the results above, it looks like OLS and LASSO are both reasonable choices when the number of covariates \(p\) is small relative to the sample size \(n\). LASSO does a better job of estimating irrelevant covariates as 0 - I don’t see how OLS could do as well there. My understanding is that LASSO sets some coefficients to 0 because of the kink in the objective function that the \(L1\) penalty adds at 0. The OLS objective function is smooth through 0, so it seems like it should always keep every predictor, even if they’re estimated as really tiny values with big standard errors.
Where LASSO really shines is when \(p\) starts getting larger relative to \(n\). In the second and third simulations, OLS produces some pretty ridiculous parameter estimates for both the relevant and irrelevant predictors, while LASSO stays on point - the estimates always have the correct sign and the correct order of magnitude. This Stack Exchange post has a good discussion of when to use LASSO vs OLS for variable selection. To be fair the sines I generated as covariates are probably fairly correlated on average, so there’s most likely some collinearity messing with the OLS estimates (I think this is what’s going on with all the NAs).
So where and how is LASSO used in economics? I hear it’s not uncommon in financial economics, and I know it’s being applied in spatial econometrics. But I haven’t seen it in the mainstream of applied fields like labor, development, or environmental economics (where reduced-form methods seem to be preferred). There, from what I’ve seen, unregularized approaches like OLS continue to dominate. Why might this be the case?
Spatial is more of a tool-building area than a specific-topic area like financial or labor so simulations can be relevant to the questions of interest, but it makes spatial a less-apt comparison to fields like labor or development. Let’s ignore spatial for the rest of this discussion. If you’re interested in the two spatial econometrics papers using LASSO I mentioned at the beginning of this post, I wrote a short review for that econometrics class which might be useful. You can find it here.
Financial has lots of really good data, e.g. high-frequency stock price data, so it makes sense to me that there would be some scope for “fancier” econometrics there. My limited understanding of the questions studied in financial is that they tend to focus more on prediction than policy evaluation, reducing the “interpretability advantage” of OLS over other methods. So this might be a factor that limits LASSO adoption outside of financial. Jim Savage has some interesting comments in this regard - apparently he uses LASSO as a variable selection tool for which interactions to include as controls in a causal model. That seems like a good use of LASSO’s model selection consistency, and is something I’ll probably start doing as well. (As an aside, Jim Savage’s random forest proximity weighting paper mentioned in that comment got me stoked about random forests all over again. I think it’s worth a read if you’re interested in machine learning and econometrics.)
Data quality in other applied fields is a mixed bag. Labor has lots of high-quality data, but development folks I know tell me development doesn’t. In environmental it seems to depend on the question. I don’t know much about industrial organization and trade, but I’ve been told they tend to use more “fancy” econometrics than other fields. Structural vs. reduced-form preferences might play a role in this, and in general I think many researchers in applied fields care more about policy evaluation than prediction (see Jim Savage’s comment above).
In some applied fields, I know that there’s a strong preference for unbiased estimators. I think this might be a factor in limiting the use of regularized regression methods in those fields. By trading off some of the variance for bias the LASSO estimates tend to be better controlled than OLS estimates, though the bias could be a downside to folks who really like unbiased estimates.
However, I think this could actually work in LASSO’s favor; since the estimates are biased toward 0 (you can see this in the simulation tables, and it’s mentioned in the Stack Exchange post a few paragraphs up), the LASSO estimates are lower bounds on the effect size for a given policy evaluation question! I think this could be a selling point for using LASSO in applied/policy-centric questions since it lets the researcher say something like “look, the effect of this policy is at least 5 whatevers, and possibly a little bigger”.
Models can easily blow up when you’re controlling for a lot of things - individual fixed effects or interactions, for example, can add a lot of predictors to a model. The applied folks I know tend to say they care more/entirely about the parameter estimate of the policy variable and not the control variables, but just having a lot of controls in the model can reduce the precision of the estimates of interest. LASSO might be a good approach for such settings. That said, for the fixed effects case I don’t think the estimates could still be interpreted as “controlling for” the unobservables being captured by the fixed effects, since the fixed effects might be set to 0 if they aren’t sufficiently correlated with the outcome. Does this mean those effects didn’t need to be controlled for? I think the answer is yes… this seems relevant to the “LASSO as variable selector” approach to choosing control variables.
Significance testing is an advantage for OLS over LASSO in applied economics. There are a lot of results out there about significance testing for OLS, and folks understand them (or at least, think they do, which is the same for adoption purposes). Inference under LASSO is a current area of research, and apparently Tibshirani recently came out with a significance test for the LASSO (I really like the discussion of this over at Gelman’s blog).
At some point I’d like to try a “LASSO difference-in-differences with fixed effects” simulation to see what happens.
UPDATE: These slides have a good discussion of LASSO in economics and econometrics, as well as some properties of the “LASSO to select, OLS to estimate” approach (“post-LASSO estimator”). Apparently post-LASSO can remove much of the shrinkage bias present in LASSO, and perform nearly as well as an OLS regression on just the relevant covariates.