Distribution regression in R

29 Jul 2021

Distribution regression is a cool technique I saw someone talking about on Twitter. The idea is pretty straightforward: given an outcome \(Y\) which depends on some \(X\), we recover the distribution of treatment effects of \(X\) on \(Y\) by running regressions of the form \(I(Y \leq y_i) = f_i(X) + \epsilon\), where \(I(Y \leq y_i)\) is an indicator variable for whether \(Y\) is less than or equal to some value \(y_i\). By running this for a grid of \(y_i\) values covering the range over observed \(Y\), we recover the distribution of \(Y\) given values of \(X\). Chernozhukov et al. (2012) discusses this and other ideas in more detail (ungated version here).

The advice I’ve seen on choosing \(y_i\) is to use the quantiles of \(Y\), which gives a uniform grid. \(f_i\) could be a different function for each \(y_i\), or just a different coefficient in a sequence of linear models.

A simple example with a homogeneous treatment effect

Let’s illustrate this with a simple example. Suppose we have a binary \(X\) which has a constant treatment effect on a continuous \(Y\). The model is

\[\begin{equation} Y = 1 + 2X + \epsilon, \end{equation}\]

where \(\epsilon \sim N(0,1)\).

library(tidyverse)
library(patchwork)
set.seed(101)

# Generating the data
X <- round(runif(1000,0,1),0)
Y <- 1 + 2*X + rnorm(1000,0,1)
dfrm <- data.frame(Y=Y, X=X)

# Generating the grid
nodes <- as.numeric(quantile(Y, probs = seq(0,1,0.01)))[-1] # convert it to a numeric to strip the labels, and drop the min observation since we can't do anything with it

# Generating the indicators
for(i in seq_along(nodes)) {
  dfrm <- dfrm %>% mutate( "Y_{i}0" := ifelse(Y<=nodes[i], 1, 0) )
}

# summary(dfrm)

Now let’s run the regressions:

intercept_vec <- rep(NA,length.out=length(nodes))
prediction_vec_0 <- rep(NA,length.out=length(nodes)) # conditional distribution of Y given X = 0
prediction_vec_1 <- rep(NA,length.out=length(nodes)) # conditional distirbution of Y given X = 1
se_vec_0 <- rep(NA,length.out=length(nodes)) # vector for standard error of Y|X=0. obtain 95% CI by multiplying SE by 1.96 and +/- from prediction
se_vec_1 <- rep(NA,length.out=length(nodes))
models_list <- list()

for(i in seq_along(nodes)) {
  # Generate the models
  # models_list[[i]] <- lm(dfrm[,i+2] ~ X, data = dfrm)
  models_list[[i]] <- glm(dfrm[,i+2] ~ X, data = dfrm, family = "binomial")

  # Predictions when X=0
  # prediction <- predict(models_list[[i]], newdata = data.frame(X=0), se.fit=TRUE) 
  prediction <- predict(models_list[[i]], newdata = data.frame(X=0), se.fit=TRUE, type = "response") 
  prediction_vec_0[i] <- prediction$fit[[1]]
  se_vec_0[i] <- prediction$se
  # Predictions when X=1
  # prediction <- predict(models_list[[i]], newdata = data.frame(X=1), se.fit=TRUE) 
  prediction <- predict(models_list[[i]], newdata = data.frame(X=1), se.fit=TRUE, type = "response") 
  prediction_vec_1[i] <- prediction$fit[[1]]
  se_vec_1[i] <- prediction$se
}

results <- data.frame(Y_0 = prediction_vec_0, se_0 = se_vec_0, Y_1 = prediction_vec_1, se_1 = se_vec_1, Y=nodes) %>%
  arrange(Y)

F_y_x_plot <- ggplot(data = results, aes(x=Y)) + 
  geom_line(aes(y=Y_0), color="dodgerblue2", size=1) +
  geom_line(aes(y=Y_1), color="firebrick2", size=1) +
  geom_ribbon(aes(ymin=Y_0-1.96*se_0, ymax=Y_0+1.96*se_0), alpha = 0.25) +
  geom_ribbon(aes(ymin=Y_1-1.96*se_1, ymax=Y_1+1.96*se_1), alpha = 0.25) +
  labs(title="Conditional distribution of Y (blue: X=0, red: X=1)", x="Y value", y="F(y|x)") +
  theme_bw()

Q_y_x_plot <- ggplot(data = results, aes(y=Y)) + 
  geom_line(aes(x=Y_0), color="dodgerblue2", size=1) +
  geom_line(aes(x=Y_1), color="firebrick2", size=1) +
  geom_ribbon(aes(xmin=Y_0-1.96*se_0, xmax=Y_0+1.96*se_0), alpha = 0.25) +
  geom_ribbon(aes(xmin=Y_1-1.96*se_1, xmax=Y_1+1.96*se_1), alpha = 0.25) +
  labs(title="Conditional quantiles of Y (blue: X=0, red: X=1)", x="F(y|x)", y="Y") +
  theme_bw()

F_y_x_plot | Q_y_x_plot

plot of chunk unnamed-chunk-3

In the left plot, the blue line is the distribution of \(Y\) given \(X=0\) and the red line is the distribution of \(Y\) given \(X=1\). The right plot shows the corresponding conditional quantiles of \(Y\). The average difference in Y values between the blue and red distributions is about 2—almost exactly the treatment effect of \(X\) on \(Y\). It’s consistent throughout, indicating that the effect is constant. The conditional distribution for \(Y \vert X=0\) reaches \(1\) faster than the distribution of \(Y \vert X=1\), affirming that having \(X=1\) increases the value of \(Y\).

It’s a bit awkward that the confidence are going above 1/below 0, but that’s fixable. Chernozhukov et al. (2012) also discuss a bootstrap procedure for generating “correct” confidence intervals. I don’t actually know if the “usual” way I did them here is appropriate, or what assumptions it would imply.

Multiple regression

Distribution regression can scale to multiple RHS variables, too. Suppose we take the model from before (binary \(X\), continuous \(Y\)) and introduce a variable \(Z\). The model is

\[\begin{equation} Y = 1 + 2X + 0.75Z + \epsilon, \end{equation}\]

where \(\epsilon \sim N(0,1)\).

# Generating the data
X <- round(runif(1000,0,1),0)
Z <- runif(1000,0,1)
Y <- 1 + 5*X + 2*sqrt(Z) + rnorm(1000,0,1)
dfrm <- data.frame(Y=Y, X=X, Z=Z)

# Generating the grid
nodes <- as.numeric(quantile(Y, probs = seq(0,1,0.01)))[-1] # convert it to a numeric to strip the labels, and drop the min observation since we can't do anything with it

# Generating the indicators
for(i in seq_along(nodes)) {
  dfrm <- dfrm %>% mutate( "Y_node{i}" := ifelse(Y<=nodes[i], 1, 0) )
}
mean(dfrm$Z)

## [1] 0.5141996

intercept_vec <- rep(NA,length.out=length(nodes))
prediction_vec_0 <- rep(NA,length.out=length(nodes)) # conditional distribution of Y given X = 0
prediction_vec_1 <- rep(NA,length.out=length(nodes)) # conditional distirbution of Y given X = 1
se_vec_0 <- rep(NA,length.out=length(nodes)) # vector for standard error of Y|X=0. Obtain 95% CI by multiplying SE by 1.96 and +/- from prediction
se_vec_1 <- rep(NA,length.out=length(nodes))
models_list <- list()

for(i in seq_along(nodes)) {
  # Generate the models
  # models_list[[i]] <- lm(dfrm[,i+3] ~ X + Z, data = dfrm)
  models_list[[i]] <- glm(dfrm[,i+3] ~ X + Z, data = dfrm, family = "binomial")
  
  # Predictions when X=0
  prediction <- predict(models_list[[i]], newdata = data.frame(1, X=0, Z=mean(dfrm$Z)), se.fit=TRUE, type="response")
  prediction_vec_0[i] <- prediction$fit[[1]]
  se_vec_0[i] <- prediction$se
  
  # Predictions when X=1
  prediction <- predict(models_list[[i]], newdata = data.frame(1, X=1, Z=mean(dfrm$Z)), se.fit=TRUE, type="response")
  prediction_vec_1[i] <- prediction$fit[[1]]
  se_vec_1[i] <- prediction$se
}

results <- data.frame(Y_0 = prediction_vec_0, se_0 = se_vec_0, Y_1 = prediction_vec_1, se_1 = se_vec_1, Y=nodes) %>%
  arrange(Y)

F_y_x_plot <- ggplot(data = results, aes(x=Y)) + 
  geom_line(aes(y=Y_0), color="dodgerblue2", size=1) +
  geom_line(aes(y=Y_1), color="firebrick2", size=1) +
  geom_ribbon(aes(ymin=Y_0-1.96*se_0, ymax=Y_0+1.96*se_0), alpha = 0.25) +
  geom_ribbon(aes(ymin=Y_1-1.96*se_1, ymax=Y_1+1.96*se_1), alpha = 0.25) +
  labs(title="Conditional distribution of Y (blue: X=0, red: X=1)", x="Y value", y="F(y|x)") +
  theme_bw()

Q_y_x_plot <- ggplot(data = results, aes(y=Y)) + 
  geom_line(aes(x=Y_0), color="dodgerblue2", size=1) +
  geom_line(aes(x=Y_1), color="firebrick2", size=1) +
  geom_ribbon(aes(xmin=Y_0-1.96*se_0, xmax=Y_0+1.96*se_0), alpha = 0.25) +
  geom_ribbon(aes(xmin=Y_1-1.96*se_1, xmax=Y_1+1.96*se_1), alpha = 0.25) +
  labs(title="Conditional quantiles of Y (blue: X=0, red: X=1)", x="F(y|x)", y="Y") +
  theme_bw()

F_y_x_plot | Q_y_x_plot

plot of chunk unnamed-chunk-5

Anyway, I thought this was a cool technique and I’d like to use it sometime.

produced under scarcity A blog

Distribution regression in R

A simple example with a homogeneous treatment effect

Multiple regression

Related Posts

Three stories about the rise of data science programs 22 May 2022