Paper review - "Allocation of Orbit and Spectrum Resources for Regional Communications - What's at Stake?"

Here I review Molly Macauley’s 1998 paper, “Allocation of Orbit and Spectrum Resources for Regional Communications: What’s at Stake?”, published in The Journal of Law and Economics.

Molly Macauley recently died in Baltimore. It seems like Dr. Macauley has written something insightful and policy-relevant on most, if not all, of the important issues in space economics today. I have found her work to be tremendously inspirational and a useful practical guide in thinking about questions in space economics. I didn’t know her personally, and I am sad that I never will. Here is a tribute page, Remembering Molly.

What’s this paper about?

This paper is about the values of slots in geostationary orbits (GEO) and the potential gains from improving GEO slot use regulations. The paper has two main goals:

  1. Develop and estimate an economic model of orbits with locational attributes. The point of this is to calculate the savings of using satellites to meet telecom demand vs. using the “best terrestrial alternative” to meet the same telecome demand. Caveat: The values derived here are not equilibrium values, rather upper bounds on the maximum willingness-to-pay of countries for having satellites in preferred orbital locations.

  2. Model and estimate the effects of technological substitution possibilities which would allow orbit users to “economize on the amount of orbit used at any given location”, for example by giving satellites more complex antennae to use allocated spectrum bandwidths more intensively at “prime” orbital locations. Macauley’s argument is that allocating orbits administratively shields users from facing the true costs of that orbit (the value to other users), which reduces the incentive to use orbits more intensively.

In writing the model, Macauley is drawing on urban economics models of spatial location. A quote from the paper illustrates the argument:

“…just as land values increase with proximity to a central business district as commuting and other transportation costs fall, orbit values might be expected to increase at locations that afford the best communications-demand-related coverage…just as an increase in land values leads to substitution toward capital and away from land…so too might an increase in orbit values lead to more capital intensive use of the orbit at preferred locations.”

So in the same way that urban land users in downtown areas use land more intensively (taller buildings) than users in the suburbs (flatter buildings and more sprawl), orbit users have incentives to use prime orbits more intensively (more spectrum use per unit of orbit) than users in non-prime locations.

If spectrum congestion weren’t an issue, GEO spacing could be minimum physical safe distance (40 miles or so). To guard against spectrum congestion, regulators mandate separations of 800-1600+ miles. If satellites were closer together, they could

  1. communicate at different frequencies, or
  2. be situated at noninterfering distances from other satellites communicating at the same frequency, or
  3. coordinate the timing of their signal emissions to not interfere (this is apparently practiced by terrestrial spectrum users)

Regulators are concerned with spectrum congestion, but the regulations in place reduce orbit users’ need to use orbit more intensively with these types of spectrum use techniques.

It’s worth noting that there aren’t well-defined property rights for electromagnetic spectrum in orbit. Macauley cites a number of studies on the problems this causes, including Jackson (1978), Levin (1981), Levin (1985), Levin (1990). Some googling turns up other helpful writings on the issue of property rights in space, like this AEI article by Macauley and Paul R. Portney and this ITU report on economic aspects of spectrum management. Macauley (1986) also discusses the issue of spectrum regulation as it relates to communications satellites.

There is a substantial economic literature on spectrum allocation in general which I won’t discuss here. This paper review seems like a good place to start thinking about potential contributions. In general GEO and spectrum use for telecom purposes had been studied quite a bit by economists when Macauley wrote this paper.

GEO slot use is tightly linked to electromagnetic spectrum use in orbit. Satellite use thus suffers from at least two market failures: one caused by misallocation of orbital slots, and one caused by misallocation of spectrum. These issues are fundamentally tied to the common-pool resource nature of orbital slots and spectrum.

Models and data

The market for satellite-related services ultimately determines the derived demand for a location in the orbit.

Macauley builds two related models in this paper, the second a generalization of the first. The first assumes that orbits and spectrum are used in fixed proportions. I understand this to mean “spectrum use must be the same at any location in GEO”. The second relaxes this assumption by allowing orbit and spectrum to be substitutable inputs in a Cobb-Douglas production function for satellite services. The first case corresponds to a Cobb-Douglas production function with equal exponents for orbit and spectrum.

In both models, the telecom service provider has a choice between using satellite or terrestrial technologies to meet the exogenous telecom service demand between country pairs \(jk\). The advantage of using satellites is that they can cheaply connect any two countries \(j\) and \(k\) within the satellite’s Field of View (FOV). Terrestrial technologies are affected by the distance between \(j\) and \(k\) and require signal repeaters.

Macauley describes how FOV is determined at each location in the beginning of the paper. I will not discuss it here.

Fixed proportions

This model provides the foundation for the analysis in this paper. It is a simulation model of a cost-minimizing telecom service producer’s choice of service provision technology. The equations which define the model are

Terrestrial technologies

\[\begin{align} Q_i &= Q(H_i) \cr C_{ijk} &= C_i(Q_{jk},Z_{jk}, \delta_i(\bar{Q}_i)) \end{align}\]

where \(Q()\) represents the output of communications services, and \(C()\) represents the cost of the technology in use.

\(i = 1, \dots, m\) indexes the potential terrestrial technologies which could connect country pairs \(jk\); \(H_i\) represents the hardware which forms the connection and associated costs (e.g. coastal landing rights for undersea cable, overland rights of way for terrestrial cable); \(Z_{jk}\) represents the distance between countries \(j\) and \(k\).

Satellites

\[\begin{align} Q_s &= Q(A, S, H_s) \cr C_{sjk} &= C_s(Q_{jk}, \delta_s(\bar{Q}_s); L_1, \dots, L_{\tau}) \text{ for } (j,k) \in FOV \end{align}\]

where \(A\) represents an orbital slot, \(S\) represents spectrum, and \(H_s\) represents the satellite hardware and associated costs (including launch costs); \(L_1, \dots, L_{\tau}\) represent the exogenously-given possible orbital locations indexed \(1, \dots, \tau\).

Demand and indivisibilities

\[\begin{align} \delta(\bar{Q}) &= \sum_{j=1}^{n-1} \sum_{k=j+1}^{n} Q_{jk}, ~~0< \delta \leq 1 \cr Q_{jk}^D &\leq \bar{Q}_{jk} \end{align}\]

\(\bar{Q}\) represents the exogenously-given total quantity of communications demand for the set of \(n\) countries that want to be connected. \(\delta(\bar{Q})\) reflects individibilities in the various technologies when they operate as a system; I interpret this as meaning that these capacity investments are discrete and not continuous in nature, so that the system’s installed capacity will typically exceed demand on a single country-pair route. I am not sure what “\(0< \delta \leq 1\)” refers to - maybe there’s a typo where a \(\delta\) exponent is missing? Or maybe there’s a normalization so that \(\delta(\bar{Q}) \in (0,1]\)?

\(Q_{jk}^D\) measures the “busy hour” traffic between \(j\) and \(k\), and implies a constraint on the facilities - they must be able to serve this demand at an acceptable degree of service (i.e. low probability of receiving a busy signal). These values are taken from the engineering literature.

This setup lets Macauley calculate the cost savings from using satellites over terrestrial technology as

\[\begin{align} \min TC(L_i) &= \sum_{j=1}^{n-1} \sum_{k=j+1}^{n} C_{jk} (Q; \delta, z, L_i) \cr R(L_i) &= TC(L_i) - TC(L_x) \end{align}\]

where \(TC()\) represents the total cost to serve \(n\) countries that want to be connected using a satellite located at location \(L_i\). \(L_x\) represents the next-best available orbital slot that could connect those \(n\) countries. \(R(L_i)\) represents the cost savings of using the most-preferred orbital slot for a set of countries in a region. The outside option of not using satellite technology to connect \(n\) countries is represented by \(R(L_i) = R(L_i) - R(L_i) = 0\).

The phrase “prime orbit location” refers to the orbit location with a maximum value for a specific nation. I am thinking about it as the GEO location above a country which puts the most important countries it communicates with in the satellite’s FOV, i.e. a satellite in prime location for country \(j\) maximizes the value of communications between \(j\) and the countries it wishes to communicate with, \(\{k\}\). In South America, this definition maps “prime locations” to “locations which cover the most other countries \(k\)”. In North America, population density and communications patterns break this mapping. For example, site values in the Pacific Rim under the fixed proportions model are high when just Canada and the US are in view, excluding many of the other PacRim countries.

Variable proportions

The supply and demand equations for the variable proportions model are

\[\begin{align} X_s(u) &= DA(u)^{\alpha}S(u)^{1-\alpha} \cr X_d(u) &= \beta y^{\theta_1} p(u)^{\theta_2} \end{align}\]

where \(D\) is a scale parameter, \(A\) and \(S\) are quantities of orbit and spectrum, \(\alpha \in (0,1)\) is the Cobb-Douglas exponent; \(\beta\) is a scale parameter, \(y\) is income, \(p\) is price, and \(\theta_1\) and \(\theta_2\) are the income and price elasticities respectively. \(u\) measures the satellite location relative to prime locations. I didn’t see this explicitly stated, but I’m assuming \(u: u=0 \implies\) prime location.

This model also has an equation for locational equilibrium, as

\[p'(u) X_d(u) = -t(u)\]

where \(t(u)\) represents the marginal cost of connecting cities outside the view of a satellite at location \(u\), i.e. the extra costs incurred to get to the edge of the satellite’s FOV. Macauley estimates \(t(u)\) using Mathematica to fit a polynomial of the form

\[t(u) = \gamma \frac{e^{(u - \bar{u})/\sigma}}{1 + e^{(u - \bar{u})/\sigma}}\]

where \(u - \bar{u}\) measures distance in longitudinal degrees at \(u\) from the prime orbit location, \(\bar{u}\), and \(\sigma\) is a curve-fitting parameter. \(\gamma\) is not explicitly described, but I’m assuming it’s just a scale parameter for the curve-fitting.

These equations allow Macauley to derive an equation for orbital slot value (derivation in Appendix B). It’s a complicated equation which I won’t describe here.

Data

The models are estimated using data obtained from engineering estimates (for the technology) and trade publications (for volume and value of calls). Macauley states that demand estimates are an order-of-magnitude proxy only. The data are described in Appendix A.

Discussion

Macauley’s stated purpose in writing this paper was to figure out the potential gains from allowing more flexible orbit/spectrum use patterns. This is not what my current space economics project is about; I am looking at a general model of orbit choice and debris production. This model is specifically designed for telecommunications satellites (therefore focused on GEO), while the model I am writing is valid for any type of satellite in any not-too-elliptical orbital regime (my implicit focus is on LEO, but it can be reparameterized for GEO).

However, I had not thought seriously about spectrum use prior to reading this paper. Now it seems obvious to me that most satellite applications require not only an orbital slot (or path) but also some electromagnetic spectrum. An imaging satellite which requires images to be physically retrieved from it is not very useful. I suppose tourist shuttles that take people to LEO and back down have less need for spectrum, but they would still need to be able to coordinate with ground-based observers and other satellites to ensure a safe trip.

The question before me now is, how relevant is spectrum use to my applications? I think it’s probably of second-order importance to modeling debris production from satellite operations outside GEO (say, in LEO), but might be of first-order importance in GEO. I was on the fence about incorporating locational attributes of orbital slots/paths in a general sort of way, but now I think some general locational attribute(s) would be first-order important even in LEO.

Back to this paper: Macauley points out that the Cobb-Douglas form is restrictive, and defends the use of this functional form in two ways:

  1. by appealing to some anecdotal engineering about the use of orbit and spectrum. The Cobb-Douglas form’s restrictions implies that the ratio of input use be proportional to the ratio of factor prices. On the basis of published operating parameters, satellites in prime locations use a unit of orbit 7-10 times more intensively than in non-prime locations, through techniques like spectrum re-use. The orbit values she estimates with the first model imply values of prime locations that are approximately 10 times larger than in non-prime locations. So the Cobb-Douglas form may not be too bad an approximation in this setting.
  2. by pointing out that data are unavailable to estimate all the parameters of a more flexible functional form.

I may be missing it, but the functional form for \(Q()\) the “fixed proportions” case seems to be unspecified. In the “variable proportions” case, I believe it is the Cobb-Douglas function given for \(X_s(u)\). I didn’t see this explicitly stated, but I’m assuming that \(Q()\) in the “fixed proportions” case is a function with the same properties as \(X_s(u)\) with \(\alpha=0.5\).

I didn’t see any assumptions on \(p(u)\) explicitly stated, but I’m assuming \(p(u) : p'(u) <0, p''(u) \leq 0\) based on the locational equilibrium condition. I’m interpreting this in words as: the closer the satellite is to prime location (i.e. as \(u \to 0\)), the higher the price of the satellite services. Since price elasticities of demand are negative for normal goods (i.e. \(\theta_2 < 0\)), this higher price scales the demand down. I can’t think of any reason why satellite services or telecom services would be giffen goods, so this logic seems reasonable.

I’m not sure I understand how the locational equilibrium was derived, but it might be in some of the urban economics papers Macauley referenced. I’m interpreting it as a first-order condition from a location-choosing problem.

The satellite spacing policy seems pretty bad in an “unintended consequences” way: policy intended to minimize spectrum congestion ends up obstructing the development of policies and technologies that could mitigate congestion. It’s possible this spacing leads to more congestion in this equilibrium than there would be if the spacing regulations were relaxed, but I don’t know if that’s the case. Regardless, Macauley’s argument for deadweight loss from the current intensity-limiting policy seems solid.

I’d like to write a paper about GEO use specifically; estimating this model with more data, or extending this model, might be good places to start. I liked the “calibrated simulation” approach Macauley used, and would like to do something similar. I think this was possible here because all of the variables were closely tied to things in the real world - this paper had math, but it was not mathy.

Overall I really liked this paper. I thought it was insightful, well researched and thought through, and very helpful for someone getting started in this area.

View or add comments

Looking at LASSO - parameter estimation in a contrived example

I’ve been thinking about LASSO a lot over the last few months. I first heard about LASSO over at Gelman’s blog a few years ago (I can’t remember the exact post), but didn’t follow most of the discussion or spend much time trying. I didn’t really understand what the fuss was about until last semester, when my econometrics professor showed me some papers in spatial econometrics using LASSO (this one by Elena Manresa and this one by Cliff Lam and Pedro Souza). Going through those posts again, regularized regressions are now the coolest thing to me since the HydroFlask.

My dad and I visited some relatives in northwest Karnataka last week where I had limited internet/distractions, so I finally threw together a LASSO simulation I’ve been thinking about since that econometrics class. My goal was to see how LASSO’s parameter estimation compares to OLS’s. I don’t have any new results; everything here has been derived or discussed in greater detail somewhere else. This post is to convince myself of some of LASSO’s properties relative to OLS without doing derivations, to get familiar with the glmnet package, and to procrastinate on other work I should be doing.

What is the LASSO?

The idea with OLS is to estimate the model \(Y = X \beta + \epsilon\) by solving

\[\min_{\beta} (Y - X \beta)^2\]

This results in an estimator for \(\beta\) which is unbiased, consistent, and even efficient under certain conditions. It’s easy to compute and interpret, its assumptions and inferential properties are well-studied… it’s pretty much the first model taught in a lot of econometrics classes and the first tool applied economists reach for to estimate things.

The idea with LASSO is to add a penalty to the problem that is increasing in the dimension of the coefficient vector \(\beta\), i.e. to solve

\[\min_{\beta} (Y - X \beta)^2 + \lambda \| \beta \|_1\]

where \(\| \beta \|_1\) is the \(L1\) norm of the vector \(\beta\). \(\lambda\) is a parameter that controls how severe the penalty for a higher-dimensional or larger \(\beta\) is - bigger \(\lambda\) means a sparser model. It is typically tuned through cross-validation. This procedure results in a biased but consistent estimator of \(\beta\).

Note that the consistency described in the Zhao paper at the “consistent” link is not the same consistency that econometricians usually talk about; the former is model selection consistency, the latter is parameter estimation consistency. The Zhao paper has some discussion of this distinction, and refers to Knight and Fu (2000) for proofs of estimation consistency and asymptotic properties of a class of regularized estimators which include LASSO as a special case.

That’s all I’ll say about LASSO here. There’s a lot of discussion of the LASSO online. Rob Tibshirani has a good introduction with references here, the wikipedia page is also decent, and googling will turn up a lot more resources.

Simulations

I used R’s glmnet package to run the LASSO estimations. My main reference was this vignette by Trevor Hastie and Junyang Qian, supplemented by some searches that I can’t remember now.

DGP and code

This is how I made the data generating process:

  1. draw 100 “seeds” \(\omega_i\) from a uniform distribution over \([0,1]\)
  2. sample n_covs of these seeds to generate sines \(X_i = \sin(\pi \omega_i t)\), where \(t\) runs from 1 to n_obs
  3. put these sines together in a matrix \(X\), and generate a dependent variable \(Y = X_1 + 3 X_2 - 5 X_3 + 7 X_4 - 9 X_5 + 11 X_6 + R\), where \(R \sim N(0,1)\) is a standard normal random variable to add a little noise

This gives me a bunch of randomly initialized deterministic regressors and a noisy linear combination of a few of them. Since OLS is unbiased and consistent, it should estimate the coefficients of the \(X_i\)s correctly. Since there are only 100 observations it probably won’t be super precise but on average it should still be close.

For each number of covariates, I estimated OLS and LASSO regressions 1000 times from steps 2 and 3, storing the coefficients in a matrix. My R code below might make this process clearer.

library(glmnet)

lasso_ols_mc <- function(n_covs, n_obs, n_iter) {
seeds <- runif(100,min=0,max=1) #uniform draws for numbers to plug into sines in X
### initialize coefficient storage
ols_coefs <- rep(0,length=n_covs)
lasso_min <- rep(0,length=n_covs)

for(j in 1:n_iter) { #monte carlo loop  
#### The DGP  
  seeds_sample <- sample(seeds, n_covs, replace=FALSE) #random sample from the seeds... not super reproducible :(
  step <- seq(from=1,to=n_obs,by=1) #time step 
  X <- matrix(0,nrow=n_obs,ncol=n_covs) #matrix to store sines in columns
  for (i in 1:n_covs){ #loop over columns to create n_covs many deterministic regressors
    X[,i] <- sinpi(seeds_sample[i]*step) #sines - deterministic regressors
  }
  
#### The Y variable
  Y <- X[,1] + 3*X[,2] - 5*X[,3] + 7*X[,4] - 9*X[,5] + 11*X[,6] + rnorm(n_obs) #linear combination of sines + gaussian noise
  
#### OLS fit
  olsfit <- lm(Y ~ X)
  summary(olsfit)
  ols_coefs <- rbind(ols_coefs,as.matrix(coef(olsfit))[2:(n_covs+1),1]) #store coefficients from this run
  
#### LASSO fit
  lassofit <- cv.glmnet(x=X,y=Y,alpha=1,nlambda=100)
  lasso_min <- rbind(lasso_min,as.numeric(coef(lassofit, s = "lambda.min", exact = FALSE)[2:(n_covs+1),1])) #store coefficients from this run

}

#### Cleaning up a bit
ols_coefs <- as.data.frame(ols_coefs[-1,], row.names=as.character(seq(1:dim(ols_coefs)[1])))
lasso_min <- as.data.frame(lasso_min[-1,], row.names=as.character(seq(1:dim(lasso_min)[1])))

#### Return list of coefficients
coefs <- list()
coefs[[1]] <- ols_coefs
coefs[[2]] <- lasso_min

return(coefs)
}

The coefficient storage could have been done more elegantly and I don’t like those rbind() calls… but effort is costly and this is produced under scarcity.

Anyway, the function lasso_ols_mc() takes three arguments: n_covs, the number of covariates to generate; n_obs, the number of observations to generate; n_iter, the number of iterations to run. It returns a list with two dataframes of dimension n_iter by n_covs: the first is the OLS coefficients, and the second is the LASSO coefficients. The code is in the file simulation.r here.

I’m using cv.glmnet() instead of glmnet() to get the estimated coefficients from the penalization parameter \(\lambda\) which minimizes the cross-validated errors. This Stack Exchange post has a good discussion of the use of cv.glmnet() versus glmnet() and how that relates to \(\lambda\).

Simulation 1: 25 covariates

library(glmnet)
load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/lassoOlsSim/simulation.Rdata")

sim_25 <- lasso_ols_mc(25,100,1000)

ols_coefs <- sim_25[[1]]
lasso_min <- sim_25[[2]]
options(scipen=100,digits=4) #I like decimal notation
colMeans(ols_coefs)
##        X1        X2        X3        X4        X5        X6        X7 
##   0.84575   2.79689  -5.39146   6.91616 -10.39828  11.14851   0.23271 
##        X8        X9       X10       X11       X12       X13       X14 
##  -0.16579   0.25008   0.06273   0.60091  -0.50771   0.14841   0.18313 
##       X15       X16       X17       X18       X19       X20       X21 
##  -0.48205   0.06443   0.22578   0.33808   0.80858  -1.80170  -1.01977 
##       X22       X23       X24       X25 
##   0.98723   0.30708   1.78747   0.07162
colMeans(lasso_min)
##         V1         V2         V3         V4         V5         V6 
##  0.8551006  2.8313675 -4.7997349  6.7857505 -8.7650122 10.7999512 
##         V7         V8         V9        V10        V11        V12 
##  0.0112456  0.0081712  0.0058198 -0.0007110 -0.0009387  0.0024224 
##        V13        V14        V15        V16        V17        V18 
##  0.0081600 -0.0045956 -0.0002659  0.0052694  0.0081007  0.0034213 
##        V19        V20        V21        V22        V23        V24 
##  0.0003531  0.0076995  0.0004855  0.0120161  0.0125994  0.0056955 
##        V25 
##  0.0058837

The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. OLS and LASSO are both close on the first 6, with LASSO a little closer on the rest.

Simulation 2: 50 covariates

library(glmnet)
load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/lassoOlsSim/simulation.Rdata")

sim_50 <- lasso_ols_mc(50,100,1000)

ols_coefs <- sim_50[[1]]
lasso_min <- sim_50[[2]]
options(scipen=100,digits=4) #I like decimal notation
colMeans(ols_coefs)
##        X1        X2        X3        X4        X5        X6        X7 
##    45.621   -39.780    -7.825  -109.538 -1566.738   -19.541   118.284 
##        X8        X9       X10       X11       X12       X13       X14 
##    -4.013   -67.543   -59.812  -147.680    30.056  -227.749   189.434 
##       X15       X16       X17       X18       X19       X20       X21 
##    -8.550    35.778  -173.203   -44.160    27.836   -32.681    56.185 
##       X22       X23       X24       X25       X26       X27       X28 
##  -112.021   -48.255     1.190    48.058   -57.421   -67.878   279.999 
##       X29       X30       X31       X32       X33       X34       X35 
##     9.445    66.104     5.706   108.343     5.123     7.835  -168.868 
##       X36       X37       X38       X39       X40       X41       X42 
##    14.550    34.462    58.293   185.393   -31.833    18.288  1964.648 
##       X43       X44       X45       X46       X47       X48       X49 
##   -10.704   -72.467   177.646  -294.309    42.899        NA  -340.436 
##       X50 
##   175.847
colMeans(lasso_min)
##         V1         V2         V3         V4         V5         V6 
##  0.8015310  2.7677204 -4.7082946  6.6515331 -8.6152915 10.6241727 
##         V7         V8         V9        V10        V11        V12 
##  0.0130475  0.0008572  0.0124773  0.0002354  0.0095255  0.0112302 
##        V13        V14        V15        V16        V17        V18 
## -0.0021430  0.0076620  0.0008248  0.0020195  0.0016123  0.0021281 
##        V19        V20        V21        V22        V23        V24 
##  0.0021809  0.0028006  0.0122234  0.0115460  0.0021777  0.0034436 
##        V25        V26        V27        V28        V29        V30 
## -0.0031613  0.0086550  0.0061313  0.0149392  0.0016341  0.0014897 
##        V31        V32        V33        V34        V35        V36 
##  0.0089855  0.0115552  0.0050156  0.0148793  0.0016785 -0.0067179 
##        V37        V38        V39        V40        V41        V42 
##  0.0063458  0.0048662  0.0099795  0.0131269  0.0038804 -0.0015681 
##        V43        V44        V45        V46        V47        V48 
## -0.0009201  0.0020819  0.0059797  0.0012738  0.0069779  0.0040313 
##        V49        V50 
##  0.0086625  0.0026937

The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. OLS is being a little silly now… LASSO is close on the first 6, and solid on the rest.

Simulation 3: 75 covariates

library(glmnet)
load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/lassoOlsSim/simulation.Rdata")

sim_75 <- lasso_ols_mc(75,100,1000)

ols_coefs <- sim_75[[1]]
lasso_min <- sim_75[[2]]
options(scipen=100,digits=4) #I like decimal notation
colMeans(ols_coefs)
##        X1        X2        X3        X4        X5        X6        X7 
##   34430.4  -63493.0  -87764.5    8139.2   -3973.4   77824.9    7663.7 
##        X8        X9       X10       X11       X12       X13       X14 
##  229655.9   -9961.2   53668.5    1506.1   84036.2  -97414.8 -200738.6 
##       X15       X16       X17       X18       X19       X20       X21 
##  -94069.3    7208.0  -17470.4    3016.7  -33336.0  112707.8  -74305.2 
##       X22       X23       X24       X25       X26       X27       X28 
##  -57560.4  -12128.7 -212583.1  196011.4   14389.8  145308.5  -21893.7 
##       X29       X30       X31       X32       X33       X34       X35 
##   16574.2  -33854.0   53555.7     715.7  -31256.4  -59375.7  -50588.8 
##       X36       X37       X38       X39       X40       X41       X42 
##  147967.5  -10886.2   22279.1  -62503.2  -30260.0  -11028.0  -63481.6 
##       X43       X44       X45       X46       X47       X48       X49 
##    -487.8    8418.9  -37660.0    4863.7  197359.9   22782.4  -48267.1 
##       X50       X51       X52       X53       X54       X55       X56 
##    5194.4   70943.0   13752.5     328.8   10425.9  -15728.8    1447.4 
##       X57       X58       X59       X60       X61       X62       X63 
##        NA        NA   -5695.4        NA        NA        NA        NA 
##       X64       X65       X66       X67       X68       X69       X70 
##        NA        NA        NA        NA        NA        NA        NA 
##       X71       X72       X73       X74       X75 
##        NA        NA        NA        NA        NA
colMeans(lasso_min)
##          V1          V2          V3          V4          V5          V6 
##  0.74985507  2.67795119 -4.57457239  6.57015556 -8.51695515 10.53733526 
##          V7          V8          V9         V10         V11         V12 
##  0.01334905  0.00757007  0.00078230 -0.00277604  0.01200334 -0.00109266 
##         V13         V14         V15         V16         V17         V18 
##  0.00281474 -0.00030673  0.00028960  0.00372971  0.00086233 -0.00947773 
##         V19         V20         V21         V22         V23         V24 
## -0.00504350  0.00167887  0.00731918  0.01227472  0.00333442  0.00653620 
##         V25         V26         V27         V28         V29         V30 
##  0.00472042  0.00918587  0.00877708 -0.00272863 -0.00129686  0.00784940 
##         V31         V32         V33         V34         V35         V36 
## -0.00596610  0.00628460  0.00388768  0.01356444  0.00592269 -0.00004759 
##         V37         V38         V39         V40         V41         V42 
##  0.00371287  0.00170037  0.00625622  0.00320188  0.00325778  0.00603352 
##         V43         V44         V45         V46         V47         V48 
##  0.00319677  0.00579496  0.00106068  0.01256782  0.01092158  0.01242250 
##         V49         V50         V51         V52         V53         V54 
##  0.00192769 -0.00797062  0.00446120  0.00478225  0.00569254  0.01142434 
##         V55         V56         V57         V58         V59         V60 
##  0.01072754  0.00495753  0.00329634  0.00429706  0.00331320  0.00041993 
##         V61         V62         V63         V64         V65         V66 
## -0.00057534  0.00516571  0.00664865  0.00994866  0.00271078  0.00925352 
##         V67         V68         V69         V70         V71         V72 
##  0.00785356 -0.00154720  0.00700347 -0.00048195 -0.00261776  0.00690537 
##         V73         V74         V75 
##  0.00469327  0.00113490 -0.00432958

The first 6 numbers should be +1, +3, -5, +7, -9, +11, and all the rest should be 0. Go home OLS, you’re drunk.

Discussion

From the results above, it looks like OLS and LASSO are both reasonable choices when the number of covariates \(p\) is small relative to the sample size \(n\). LASSO does a better job of estimating irrelevant covariates as 0 - I don’t see how OLS could do as well there. My understanding is that LASSO sets some coefficients to 0 because of the kink in the objective function that the \(L1\) penalty adds at 0. The OLS objective function is smooth through 0, so it seems like it should always keep every predictor, even if they’re estimated as really tiny values with big standard errors.

Where LASSO really shines is when \(p\) starts getting larger relative to \(n\). In the second and third simulations, OLS produces some pretty ridiculous parameter estimates for both the relevant and irrelevant predictors, while LASSO stays on point - the estimates always have the correct sign and the correct order of magnitude. This Stack Exchange post has a good discussion of when to use LASSO vs OLS for variable selection. To be fair the sines I generated as covariates are probably fairly correlated on average, so there’s most likely some collinearity messing with the OLS estimates (I think this is what’s going on with all the NAs).

So where and how is LASSO used in economics? I hear it’s not uncommon in financial economics, and I know it’s being applied in spatial econometrics. But I haven’t seen it in the mainstream of applied fields like labor, development, or environmental economics (where reduced-form methods seem to be preferred). There, from what I’ve seen, unregularized approaches like OLS continue to dominate. Why might this be the case?

Spatial is more of a tool-building area than a specific-topic area like financial or labor so simulations can be relevant to the questions of interest, but it makes spatial a less-apt comparison to fields like labor or development. Let’s ignore spatial for the rest of this discussion. If you’re interested in the two spatial econometrics papers using LASSO I mentioned at the beginning of this post, I wrote a short review for that econometrics class which might be useful. You can find it here.

Financial has lots of really good data, e.g. high-frequency stock price data, so it makes sense to me that there would be some scope for “fancier” econometrics there. My limited understanding of the questions studied in financial is that they tend to focus more on prediction than policy evaluation, reducing the “interpretability advantage” of OLS over other methods. So this might be a factor that limits LASSO adoption outside of financial. Jim Savage has some interesting comments in this regard - apparently he uses LASSO as a variable selection tool for which interactions to include as controls in a causal model. That seems like a good use of LASSO’s model selection consistency, and is something I’ll probably start doing as well. (As an aside, Jim Savage’s random forest proximity weighting paper mentioned in that comment got me stoked about random forests all over again. I think it’s worth a read if you’re interested in machine learning and econometrics.)

Data quality in other applied fields is a mixed bag. Labor has lots of high-quality data, but development folks I know tell me development doesn’t. In environmental it seems to depend on the question. I don’t know much about industrial organization and trade, but I’ve been told they tend to use more “fancy” econometrics than other fields. Structural vs. reduced-form preferences might play a role in this, and in general I think many researchers in applied fields care more about policy evaluation than prediction (see Jim Savage’s comment above).

In some applied fields, I know that there’s a strong preference for unbiased estimators. I think this might be a factor in limiting the use of regularized regression methods in those fields. By trading off some of the variance for bias the LASSO estimates tend to be better controlled than OLS estimates, though the bias could be a downside to folks who really like unbiased estimates.

However, I think this could actually work in LASSO’s favor; since the estimates are biased toward 0 (you can see this in the simulation tables, and it’s mentioned in the Stack Exchange post a few paragraphs up), the LASSO estimates are lower bounds on the effect size for a given policy evaluation question! I think this could be a selling point for using LASSO in applied/policy-centric questions since it lets the researcher say something like “look, the effect of this policy is at least 5 whatevers, and possibly a little bigger”.

Models can easily blow up when you’re controlling for a lot of things - individual fixed effects or interactions, for example, can add a lot of predictors to a model. The applied folks I know tend to say they care more/entirely about the parameter estimate of the policy variable and not the control variables, but just having a lot of controls in the model can reduce the precision of the estimates of interest. LASSO might be a good approach for such settings. That said, for the fixed effects case I don’t think the estimates could still be interpreted as “controlling for” the unobservables being captured by the fixed effects, since the fixed effects might be set to 0 if they aren’t sufficiently correlated with the outcome. Does this mean those effects didn’t need to be controlled for? I think the answer is yes… this seems relevant to the “LASSO as variable selector” approach to choosing control variables.

Significance testing is an advantage for OLS over LASSO in applied economics. There are a lot of results out there about significance testing for OLS, and folks understand them (or at least, think they do, which is the same for adoption purposes). Inference under LASSO is a current area of research, and apparently Tibshirani recently came out with a significance test for the LASSO (I really like the discussion of this over at Gelman’s blog).

At some point I’d like to try a “LASSO difference-in-differences with fixed effects” simulation to see what happens.

UPDATE: These slides have a good discussion of LASSO in economics and econometrics, as well as some properties of the “LASSO to select, OLS to estimate” approach (“post-LASSO estimator”). Apparently post-LASSO can remove much of the shrinkage bias present in LASSO, and perform nearly as well as an OLS regression on just the relevant covariates.

View or add comments

Australian water trading data, part 2 - zero price transactions

In a previous post I looked at some water trading data from the Australian Bureau of Meteorology. In this post I focus on the zero price transactions.

As mentioned last time, trades can have a zero price for three reasons:

  1. As a rule, water trades between environmental water holders are listed at $0
  2. A single private user may transfer water they own from one area to another, recording a price of $0
  3. Some transactions between two private users may be listed as occuring at $0 because entering the price isn’t/wasn’t always mandatory (if I recall correctly, the person said this was more likely in more remote areas)

Why do I care about zero price transactions?

It has to do with reason 1. Towards the end of the Millennium Drought, most states had implemented unbundling schemes to separate water access entitlements from land titles. At face value I thought this would be a good thing, since it would lower transactions costs. But there is some discussion that it led to higher prices for environmental water, resulting in more government expenditure than would have occurred without unbundling.

This “world of the second-best” story seems really interesting to me from a policy perspective. Intuitively, it makes sense to me if water is used to produce public goods whose marginal benefits don’t enter into the market - basically a market incompleteness story. I don’t know whether this is actually what happened or not, but it seems at least possible in theory.

Doing stuff with the data

I used the same scripts as last time, prep-trading-data.r and water-eda-stuff.r, with some small modifications. I uncommented four lines in prep-trading-data.r to created an indicator variable zerop in ents and alls. zerop is 1 if the transaction had a price of $0, and 0 if the price was positive. For example,

ent_data$zerop <- rep(0,length=dim(ent_data)[1])
ent_data$zerop[which(ent_data$netPrice==0)] <- 1

and similarly for the allocation trades.

In water-eda-stuff.r, I updated the ddply(), join(), and melt() calls to include zerop as a variable to sort and merge on.

## monthly trading
ents_m <- ddply(ents,.(year,MonthOfTrade,mdb,from_state,interstate,zerop),summarize,v_ents=sum(quantityTraded),price_ents=mean(PricePerML[PricePerML!=0]),n_ents=length(quantityTraded),np_ents=length(PricePerML[PricePerML!=0]))

alls_m <- ddply(alls,.(year,MonthOfTrade,mdb,from_state,interstate,zerop),summarize,v_alls=sum(quantityTraded),price_alls=mean(PricePerML[PricePerML!=0]),n_alls=length(quantityTraded),np_alls=length(PricePerML[PricePerML!=0]))

### merge alls and ents
trades_m <- join(alls_m,ents_m,by=c("MonthOfTrade","mdb","from_state","interstate","zerop"),type="full",match="first")
trades_m[(is.na(trades_m))] <- 0 # replace NaN and NA with 0

### melt trades_m
trades_m_melted <- melt(trades_m[,c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)],id=c("year","MonthOfTrade","mdb","from_state","interstate","zerop")) # useful for plots later

From here the dataset is ready to make the plots I want.

Pictures

Let’s look at the priced/unpriced transactions by state.

library(ggplot2)
load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/au-water-trading/water-eda-stuff.RData") # load workspace

## drop everything before 2008
trades_m <- subset(trades_m,year>2008)
trades_m_melted <- subset(trades_m_melted,year>2008)

### facet by state, color by zero price/not
#### volumes over time - allocations
trade_vols_state <- ggplot(trades_m_melted[which(trades_m_melted$variable=="v_alls"),], aes(x=MonthOfTrade,y=value,color=factor(zerop)))

trade_vols_state + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + ggtitle("Allocation trading by state") + ylab("Volume (ML) traded per month") + xlab("time") + scale_colour_discrete(name="Price",labels=c("positive","zero"))

plot of chunk unnamed-chunk-1

#### volumes over time - entitlements
trade_vols_state <- ggplot(trades_m_melted[which(trades_m_melted$variable=="v_ents"),], aes(x=MonthOfTrade,y=value,color=factor(zerop)))

trade_vols_state + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + ggtitle("Entitlement trading by state") + ylab("Volume (ML) traded per month") + xlab("time") + scale_colour_discrete(name="Price",labels=c("positive","zero"))

plot of chunk unnamed-chunk-1

#### counts over time - allocations
trade_counts_state <- ggplot(trades_m_melted[which(trades_m_melted$variable=="n_alls"),], aes(x=MonthOfTrade,y=value,color=factor(zerop)))

trade_counts_state + geom_point() + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + ggtitle("Allocation trading by state") + ylab("Number of trades per month") + xlab("time") + scale_colour_discrete(name="Price",labels=c("positive","zero"))

plot of chunk unnamed-chunk-1

#### counts over time - entitlements
trade_counts_state <- ggplot(trades_m_melted[which(trades_m_melted$variable=="n_ents"),], aes(x=MonthOfTrade,y=value,color=factor(zerop)))

trade_counts_state + geom_point() + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + ggtitle("Entitlement trading by state") + ylab("Number of trades per month") + xlab("time") + scale_colour_discrete(name="Price",labels=c("positive","zero"))

plot of chunk unnamed-chunk-1

(Still no Northern Territory or Australian Capital Territory. I haven’t audited my code yet to see where I might be dropping them, or looked to see if they’re in the data at all. It’s on my to-do list.)

Takeaways:

  1. There are very few transactions with a positive price outside of New South Wales and Victoria, and few high-volume transactions. One story might be that these states are more arid, with fewer private players and less liquid water markets. NSW and Victoria account for most of the Murray-Darling Basin, so it makes sense to me that they’d have more positive price transactions. Queensland and Tasmania in particular don’t seem to have very liquid water markets, at least as measured by trades with a positive price.

  2. Not a lot of difference between positive price and zero price entitlement transactions overall. The exception is victoria, where the number of positive price entitlement trades seems to have fallen over time, while the number of zero price transactions has increased.

  3. New South Wales seems to have seen the biggest increase in zero price allocation trades over this period, both in terms of volume traded and number of trades per month. Victoria also saw similar increases, but they seem to be of smaller magnitude.

  4. There’s more activity in allocation trades than entitlement trades.

At a first approximation it seems like these data may offer some support for the second-best story of unbundling.

Unfortunately, zero price transactions are not a perfect indicator of environmental water trades - reasons 2 and 3 are still candidate explanations. What I need is a way to check the annual levels of environmental water trading implied by zero price trades against some “ground truth” measure of environmental water trades. The Australian Water Markets Reports may offer a way to do that.

I have a copy of the AWMR tables for 2013-14 hosted here, or you can get them from the Department of Agriculture and Water Resources here. Table 21 lists “Environmental allocation trade among southern Murray-Darling Basin water systems, 2013-14”, so I can check the zero price trades in those water systems* against these. I’ll do that in the next post on this subject. If it checks out, that’s a good sign and I can look at the AWMRs for other years to see if I can do something similar.

*(“water systems” is a finer level of aggregation than “state”, so I’ll have to modify water-eda-stuff.r. prep-trading-data.r should be fine.)

Ideally I’d like to use the monthly data to run some regressions to see the effect of unbundling on environmental transactions, but this may not be possible. One problem is the difficulty of defining environmental transactions at the monthly level.

Another problem is the variation in policy implementation - I believe that South Australia implemented unbundling along the River Murray Prescribed Water Course starting July 1st, 2009, and that Victoria implemented unbundling on July 1st, 2007 (Northern Victoria) and July 1st, 2008 (Southern Victoria). I don’t yet know when New South Wales implemented unbundling, but my guess is around when Victoria did or sooner.

These dates are all really early in my sample, or for regions that may not have good counterfactuals. In the case of Victorian unbundling, I don’t have any entitlement trade data before 2009. In the case of South Australia, the RMPWC is where most of the trading activity occurs, so any control region would probably have to be located in Victoria or NSW. This makes a credibly identified setup to get at the effects of unbundling a bit more challenging, but possibly still doable. In a future post I’ll describe what one such setup might look like.

View or add comments

I love Notepad++

A while back Github Pages upgraded to Jekyll 3.0. As a result, only kramdown is supported for Jekyll sites hosted on Github Pages.

Seeing as I didn’t pay close attention to what was going on, I spent a while very confused and frustrated trying to figure out why all my MathJax wasn’t displaying properly. Once I figured out what the upgrade was doing, I spent some more time confused and frustrated trying to fix the problem.

This blog post by Toban Wiebe explained what was going on in words I can grok. Unfortunately, I am one of those people with a lot of old posts to convert.

I’ve been reading through “Automate the Boring Stuff with Python” this summer, so I figured this was a perfect project - write a Python script to convert redcarpet-acceptable delimiters like \\( and \\{ to kramdown-acceptable delimiters like $$ and \{. I started down that path, and even got as far as figuring out the appropriate regular expressions and writing a first draft of a script.

Then I remembered that Notepad++ (my preferred text editor) has a “Replace All in All Opened Documents” function. Like any good myopic optimizer, I stopped writing that script immediately and pulled up Notepad++.

Problem solved. Hooray for Notepad++!

(It seems like MathJax doesn’t always load with the page anymore, but a couple refreshes solves the problem. Not sure what’s up with that.)

View or add comments

A look at some Australian water trading data

This is my first attempt at using knitr and ggplot2. The scripts to reproduce and extend everything discussed here are here. I converted the knitted Rmarkdown file to markdown with the help of this post by Andrew Brooks.

I’m currently working on two projects: one is about orbital debris and satellite launches (what my previous post was about), and the other is about trading in Australian water markets (what this post is about). In this post I’m going to describe some of the water data I have.

Why Australian water markets?

I’ve liked the idea of water markets since I learned about California’s messed up water rights as a Coro Fellow (FPPA LA ‘14, fun times). Australia actually has pretty well-defined water markets with some historical daily trading data, so I started there.

Plus, water markets lend themselves to some great awful puns about market liquidity and trade volumes.

A little bit about the markets

Australian water trading can broadly be classified into two categories: allocation trades and entitlement trades. Allocation trades are short-term transfers of an amount of water, lasting for the remainder of the water year (water years are July 1st xxx(y)-June 30th xxx(y+1)). Entitlement trades are transfers of the rights to water from a given source. Entitlement transfers can further be divided into permanent/ongoing and leases.

I say “markets” instead of “market” because each state has some different laws relating to water trading within its borders. But the broad allocations/entitlements distinction still stands. The implementation details vary. (I’m saying “entitlement” instead of “access entitlement” for brevity.)

A given water resource will be divided into consumptive and non-consumptive use. Consumptive use is for private purposes like irrigation, urban use, etc., while non-consumptive use is for things like generating hydroelectricity or to guarantee the stability of the environment in that region. Tradeable water rights are defined as shares of a consumptive pool from a given water source. Water resource plans define the amount of water available to a consumptive pool. The national water market’s resources page has more information on how this works.

There’s a lot of state-level heterogeneity in how this all looks in practice, but my understanding is that these broad statements are true across the country.

What are the data?

The data used for this post are publicly available at the Bureau of Meteorology’s Water Market Information page, specifically the “trade history” dataset.

The dataset describes all of the water transactions that occurred between 2009-2014. A lot of the trades have a zero price listed. I spoke to one of the people at the Bureau of Meteorology about this some time back, and he told me that this can be for three reasons:

  1. As a rule, water trades between environmental water holders are listed at $0
  2. A single private user may transfer water they own from one area to another, recording a price of $0
  3. Some transactions between two private users may be listed as occuring at $0 because entering the price isn’t/wasn’t always mandatory (if I recall correctly, the person said this was more likely in more remote areas)

One of the things I want to check eventually is how the distributions of zero price trades compares to the distribution of positive price trades, to see if there’s anything interesting going on there.

One of my goals for the future is to learn to use mapping tools to get data out of maps like these. Rainfall information would be really useful for this project. Along the way, it would help to find a shapefile for a map of Australia with the Murray-Darling Basin and the different water systems marked out.

Doing stuff with the data

I’m going to skip all the data cleaning and prep work (wrangling). It’s in the script prep-trading-data.r.

prep-trading-data.r outputs two csv files, one each for allocation and entitlement trades. I used rm(list=ls()) liberally throughout the prep-trading-data.r, so consider yourself warned. Don’t use the script if you have other things you want to keep in your workspace.

prep-trading-data.r does a bunch of useful things for me, like create an indicator variable for whether a trade was in the Murray-Darling Basin or not, a categorical variable describing the origin and destination states of a trade, and get rid of 14 observations with years like 2103 (at which point we’ll have no water, so they’re clearly wrong). The most tedious part of writing it was going through the Australian Water Markets Report 2013-14 to figure out whether a given water system was in the Murray-Darling Basin or not.

ents and alls are from the csv files of prepped daily water trading data. From there I used [water-eda-stuff.r](https://github.com/akhilrao/akhilrao.github.io/tree/master/public/code/au-water-trading/water-eda-stuff.r) to do a little more processing and make some tables and pictures. The data processing in water-eda-stuff.r is below:

## load libraries
library(readxl)
library(plyr)
library(reshape2)

## run prep script
rm(list=ls()) # I use these liberally. Caveat emptor.
source("prep-trading-data.r") # does some stuff to make the data nicer to use

## monthly trading
ents_m <- ddply(ents,.(year,MonthOfTrade,mdb,from_state,interstate),summarize,v_ents=sum(quantityTraded),price_ents=mean(PricePerML[PricePerML!=0]),n_ents=length(quantityTraded),np_ents=length(PricePerML[PricePerML!=0]))
alls_m <- ddply(alls,.(year,MonthOfTrade,mdb,from_state,interstate),summarize,v_alls=sum(quantityTraded),price_alls=mean(PricePerML[PricePerML!=0]),n_alls=length(quantityTraded),np_alls=length(PricePerML[PricePerML!=0]))

### merge alls and ents
trades_m <- join(alls_m,ents_m,by=c("MonthOfTrade","mdb","from_state","interstate"),type="full",match="first")
trades_m[(is.na(trades_m))] <- 0 # replace NaN and NA with 0

### melt trades_m
trades_m_melted <- melt(trades_m[,c(1,2,3,4,5,6,7,8,9,10,11,12,13)],id=c("year","MonthOfTrade","mdb","from_state","interstate")) # useful for plots later

## yearly trading
ents_y <- ddply(ents,.(year,mdb,from_state,interstate),summarize,v_ents=sum(quantityTraded),price_ents=mean(PricePerML[PricePerML!=0]),n_ents=length(quantityTraded),np_ents=length(PricePerML[PricePerML!=0]))
alls_y <- ddply(alls,.(year,mdb,from_state,interstate),summarize,v_alls=sum(quantityTraded),price_alls=mean(PricePerML[PricePerML!=0]),n_alls=length(quantityTraded),np_alls=length(PricePerML[PricePerML!=0]))

### merge alls and ents
trades_y <- join(alls_y,ents_y,by=c("year","mdb","from_state","interstate"),type="full",match="first")
trades_y[(is.na(trades_y))] <- 0 # replace NaN and NA with 0

### yearly summary
yearly_summary <- ddply(trades_y,.(year),summarize,v_alls=sum(v_alls),v_ents=sum(v_ents),n_alls=sum(n_alls),n_ents=sum(n_ents)) # for a summary table

Just some ddply(), join(), and melt() to make pictures and tables easier later. n_ents and n_alls are the number of entitlement and allocation trades respectively, while v_ents and v_alls are the total monthly volume of entitlements and allocations traded.

(I’m still a noob at this whole “generating pretty things in markdown” business, so I used this handy online markdown table generator to make this table.)

year Allocation trades (ML) Entitlement trades (ML) Number of allocation trades Number of entitlement trades
2004 1898 0 4 0
2005 980 0 1 0
2006 1256 0 38 0
2007 4279 0 76 0
2008 1090 0 17 0
2009 873878.133 449766.741 8119 2895
2010 3373516.237 1385914.681 18434 6666
2011 5527076.803 2261956.205 10070 7509
2012 5565003.714 1301803.735 15099 7804
2013 7288640.034 1915709.35 30058 7636
2014 2975364.894 827862.547 15160 2316

Note that I’m aggregating these by calendar year, not water year. “ML” stands for megaliter (1 million liters). The 2014 data only goes till September, so the apparent dropoff in trading might be mostly truncated data.

Pictures!

This is the real reason I wrote this post: pictures!

To get these to load in knitr code chunks, the script water-eda.r saves the cleaned data in a file called water-eda-stuff.RData. I’m going to focus on three quantities: volume traded (megaliters per month), trade volume (number of trades per month), and price per megaliter (Australian dollars).

load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/au-water-trading/water-eda-stuff.RData")
library(ggplot2)
## drop everything before 2008
trades_m <- subset(trades_m,year>2008)
trades_m_melted <- subset(trades_m_melted,year>2008)

## make plots
### volume over time
trade_vols <- ggplot(trades_m_melted[c(which(trades_m_melted$variable=="v_ents"),which(trades_m_melted$variable=="v_alls")),], aes(x=MonthOfTrade,y=value,color=factor(mdb)))

trade_vols + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~variable) + scale_colour_discrete(name="In MDB",labels=c("No","Yes")) + ggtitle("Water volume traded in and out of MDB") + ylab("Volume (ML) traded per month") + xlab("time")

plot of chunk unnamed-chunk-1

### trades over time
trade_counts <- ggplot(trades_m_melted[c(which(trades_m_melted$variable=="n_ents"),which(trades_m_melted$variable=="n_alls")),], aes(x=MonthOfTrade,y=value,color=factor(mdb)))

trade_counts + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~variable) + scale_colour_discrete(name="In MDB",labels=c("No","Yes")) + ggtitle("Trades in and out of MDB") + ylab("Number of trades per month") + xlab("time")

plot of chunk unnamed-chunk-1

### prices over time
trade_prices <- ggplot(trades_m_melted[c(intersect(which(trades_m_melted$variable=="price_ents"),which(trades_m_melted$value!=0)),intersect(which(trades_m_melted$variable=="price_alls"),which(trades_m_melted$value!=0))),], aes(x=MonthOfTrade,y=value,color=factor(mdb)))  # only looking at observations with positive prices

trade_prices + geom_point() + geom_smooth(method="lm",se=TRUE,fill=NA) + facet_wrap(~variable) + scale_colour_discrete(name="In MDB",labels=c("No","Yes")) + ggtitle("Monthly average trade prices in and out of MDB") + ylab("AUD") + xlab("time")

plot of chunk unnamed-chunk-1

Three things I see here:

  1. More trading of entitlements and allocations inside the MDB than outside, both in terms of volumes of water traded as well as number of trades.
  2. Allocation trading inside the MDB has been increasing faster than allocation trading outside the MDB, while entitlement trading in both regions seem to follow similar trends.
  3. Average monthly prices per megaliter seem about the same inside or outside the MDB. Entitlement prices seem to have been going up over time.

This makes sense; the Murray-Darling Basin accounts for most of Australia’s water use, so I expected to see more activity inside the MDB than outside. 4 states sit inside the MDB: New South Wales, Victoria, Queensland, and South Australia. New South Wales and Victoria have the most area inside the Basin, so we might expect the action in the Basin to be driven by those two. Let’s take a look:

load("C:/Users/Akhil/Documents/akhilrao.github.io/public/code/au-water-trading/water-eda-stuff.RData")
library(ggplot2)
## drop everything before 2008
trades_m <- subset(trades_m,year>2008)
trades_m_melted <- subset(trades_m_melted,year>2008)

## make plots
### volumes over time
trade_vols_state <- ggplot(trades_m_melted[c(which(trades_m_melted$variable=="v_ents"),which(trades_m_melted$variable=="v_alls")),], aes(x=MonthOfTrade,y=value,color=factor(variable)))

trade_vols_state + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + scale_colour_discrete(name="Product type",labels=c("Allocations","Entitlements")) + ggtitle("Water volume traded in each state") + ylab("Volume (ML) traded per month") + xlab("time")

plot of chunk unnamed-chunk-2

### counts over time
trade_counts_state <- ggplot(trades_m_melted[c(which(trades_m_melted$variable=="n_ents"),which(trades_m_melted$variable=="n_alls")),], aes(x=MonthOfTrade,y=value,color=factor(variable)))

trade_counts_state + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + scale_colour_discrete(name="Product type",labels=c("Allocations","Entitlements")) + ggtitle("Trades in each state") + ylab("Number of trades per month") + xlab("time")

plot of chunk unnamed-chunk-2

### prices over time
trade_prices_state <- ggplot(trades_m_melted[c(intersect(which(trades_m_melted$variable=="price_ents"),which(trades_m_melted$value!=0)),intersect(which(trades_m_melted$variable=="price_alls"),which(trades_m_melted$value!=0))),], aes(x=MonthOfTrade,y=value,color=factor(variable))) 

trade_prices_state + geom_point() + geom_smooth(method="lm",fill=NA) + facet_wrap(~from_state) + scale_colour_discrete(name="Product type",labels=c("Allocations","Entitlements")) + ggtitle("Monthly average trade prices in each state") + ylab("AUD") + xlab("time")

plot of chunk unnamed-chunk-2

(Northern Territory seems to be missing from my data. Might be a coding error on my part.)

  1. It looks like the increase in volume traded and trade volume for the allocations is driven by Victoria and New South Wales, with New South Wales having some bigger months in terms of both amount traded and frequency of trades.
  2. Overall prices per megaliter seem fairly flat, with the exception of entitlements in New South Wales, which seem to be increasing.
  3. Not a lot of trades with positive prices in Queensland, I don’t think any in Tasmania.

I’ll look at more cuts of this data in a future post.

View or add comments