Distribution regression is a cool technique I saw someone talking about on Twitter. The idea is pretty straightforward: given an outcome \(Y\) which depends on some \(X\), we recover the distribution of treatment effects of \(X\) on \(Y\) by running regressions of the form \(I(Y \leq y_i) = f_i(X) + \epsilon\), where \(I(Y \leq y_i)\) is an indicator variable for whether \(Y\) is less than or equal to some value \(y_i\). By running this for a grid of \(y_i\) values covering the range over observed \(Y\), we recover the distribution of \(Y\) given values of \(X\). Chernozhukov et al. (2012) discusses this and other ideas in more detail (ungated version here).
The advice I’ve seen on choosing \(y_i\) is to use the quantiles of \(Y\), which gives a uniform grid. \(f_i\) could be a different function for each \(y_i\), or just a different coefficient in a sequence of linear models.
A simple example with a homogeneous treatment effect
Let’s illustrate this with a simple example. Suppose we have a binary \(X\) which has a constant treatment effect on a continuous \(Y\). The model is
\[\begin{equation}
Y = 1 + 2X + \epsilon,
\end{equation}\]
where \(\epsilon \sim N(0,1)\).
Now let’s run the regressions:
In the left plot, the blue line is the distribution of \(Y\) given \(X=0\) and the red line is the distribution of \(Y\) given \(X=1\). The right plot shows the corresponding conditional quantiles of \(Y\). The average difference in Y values between the blue and red distributions is about 2—almost exactly the treatment effect of \(X\) on \(Y\). It’s consistent throughout, indicating that the effect is constant. The conditional distribution for \(Y \vert X=0\) reaches \(1\) faster than the distribution of \(Y \vert X=1\), affirming that having \(X=1\) increases the value of \(Y\).
It’s a bit awkward that the confidence are going above 1/below 0, but that’s fixable. Chernozhukov et al. (2012) also discuss a bootstrap procedure for generating “correct” confidence intervals. I don’t actually know if the “usual” way I did them here is appropriate, or what assumptions it would imply.
Multiple regression
Distribution regression can scale to multiple RHS variables, too. Suppose we take the model from before (binary \(X\), continuous \(Y\)) and introduce a variable \(Z\). The model is