A build-stencil-cut approach to constructing satellite histories

I have a dataset with every satellite ever on orbit, with fields for launch date and decay date. There are many possible sources for such a dataset; this one is from CSpOC and UCS. The UCS data is a nice complement to the CSpOC data because it tracks the operators over time since its inception, which is useful in accounting for ownership changes. The dataset was constructed by two excellent summer RAs, Gordon Lewis (class of 21) and Ethan Berner (class of 22).

The final product, soy_panel, is a panel dataset at the satellite-operator-year level which records the history of every satellite ever on orbit. The strategy is to do this in two steps:

  1. Create a big table of all objects in all possible time periods (“build the cardboard”).

  2. Filter the table to keep only actual histories (“stencil and cut”).

Packages

This uses data.table to build the cardboard, and tidyverse functions to stencil and cut.

library(data.table)
library(tidyverse)

Building the cardboard

We read in the data. Some satellites are duplicated, so we remove those. Then we create the cardboard: a big table of all satellites (uniquely identified by COSPAR_number) in all years since 1967.

input <- read.csv("../public/data/JSpOC_UCS_data.csv")
## Error in file(file, "rt"): cannot open the connection
dups_idx <- duplicated(input[,c("COSPAR_number")]) # duplicated marks the row with the smallest row index as the original (FALSE), and the subsequent ones as duplicates (TRUE).
## Error in duplicated(input[, c("COSPAR_number")]): object 'input' not found
input_unique <- input[!dups_idx,] # removes all rows where dups_idx is TRUE
## Error in eval(expr, envir, enclos): object 'input' not found
# builds the cardboard
soy_panel_base <- setDT(input_unique)[, .(
				Year = seq(1967,2020,by=1)
				), by = COSPAR_number]
## Error: Cannot find symbol input_unique

The final command does two things.

First, setDT() makes input_unique a data.table instead of a data.frame. This allows it to be manipulated with copy-in-place, and enables some nice features from the data.table package. A data.table is a kind of data.frame with some extensions. For this task copy-in-place is a nice side effect of using a data.table, but not the main point. Here’s an explanation of copy-in-place with comparison to Python’s pandas.

Specifically, here’s what we get:

  1. Making input_unique a data.table allows use of [] to reference/add columns without re-specifying input_unique$ inside the []. The new column is inside the .(). The .() creates a new column, Year, which is a sequence from 1967-2020 in increments of 1. This is the domain over which a satellite’s can have a history.

  2. data.tables have a by operator which can be used to do grouped operations. This is the by = COSPAR_number, which in this context specifies that the new column creation specified earlier should be done once for each unique COSPAR_number. This gives each satellite a copy of the domain (entries in Year for the sequence of years 1967-2020), “building the cardboard”.

  3. Copy-in-place speeds things up by avoiding unnecessary copying.

The same could be extended to daily histories by specifying the new column to be a sequence of dates rather than just years, but yearly histories are a good starting point for satellites.

Once we have the histories, we (re-)attach details from input_unique.

soy_panel <- inner_join(soy_panel_base, input_unique, by = "COSPAR_number")
## Error in inner_join(soy_panel_base, input_unique, by = "COSPAR_number"): object 'soy_panel_base' not found

One of the details is the operator. We’ll ignore mergers and acquisitions here. (The underlying data, input, already accounts for this with unique satellite-operator entries.)

Stencil and cut

Now we mark the years in which each satellite was on orbit (“stencil”) and remove the other years from its history (“cut”). To mark the stencil, we check whether the satellite has been launched yet in each year and whether it’s decayed yet in each year. Some satellites have blank decay fields because they’re still on orbit, which R represents as NA. The decay dates also need to be trimmed to just the year.

# Mark the satellite's pre-history
soy_panel <- soy_panel %>% mutate(launched_yet = sign(Year - LAUNCH_YEAR)) # mutate creates a new column that's +1 if Year>=LAUNCH_YEAR
## Error in eval(lhs, parent, parent): object 'soy_panel' not found
# Removes the day and month part of the DECAY date, then coerce from char back to num
soy_panel$DECAY <- as.character(soy_panel$DECAY) 
## Error in eval(expr, envir, enclos): object 'soy_panel' not found
soy_panel$DECAY <- substr(soy_panel$DECAY,1,nchar(soy_panel$DECAY)-6)
## Error in substr(soy_panel$DECAY, 1, nchar(soy_panel$DECAY) - 6): object 'soy_panel' not found
soy_panel$DECAY <- as.numeric(soy_panel$DECAY)
## Error in eval(expr, envir, enclos): object 'soy_panel' not found
# Mark whether the decay field is blank
soy_panel <- soy_panel %>% mutate(decay_na = is.na(DECAY))
## Error in eval(lhs, parent, parent): object 'soy_panel' not found
# Mark the satellite's post-history, if NA, apply -1
soy_panel <- soy_panel %>% mutate(decayed_yet = sign(Year - DECAY)) # +1 if Year>=DECAY
## Error in eval(lhs, parent, parent): object 'soy_panel' not found
soy_panel[is.na(soy_panel$decayed_yet),"decayed_yet"] <- -1
## Error in soy_panel[is.na(soy_panel$decayed_yet), "decayed_yet"] <- -1: object 'soy_panel' not found

A quick cut:

soy_panel <- soy_panel[soy_panel$launched_yet >= 0,] # keep only rows with launched_year>=0
## Error in eval(expr, envir, enclos): object 'soy_panel' not found
soy_panel <- soy_panel[soy_panel$decayed_yet != 1,] # keep only rows with decayed_yet!=1
## Error in eval(expr, envir, enclos): object 'soy_panel' not found

We kept the year that the satellite was launched, but removed the year it was marked as decayed. This treats satellites as reaching orbit at the start of the launch year and as decaying at the start of the decay year.

And that’s it

We started with a dataset where each row was a unique satellite-operator combination (ignoring the duplicate rows) that had date attributes marking when they reached orbit and when they left it. We used a “build-stencil-cut” approach to end with a satellite-operator-year.

This approach can be generalized to other kinds of entity histories, including those with multiple start/stop dates. If the number of start/stop dates is the same for all objects, first define a start/stop pair as a start date and the nearest stop date. The stenciling can be repeated manually or in a loop over start/stop pairs (presumably with a suitably-fast loop). If the number of start/stop dates is variable, a nested loop can be used. In this approach the outer loop is over objects and the inner loop is over start/stop pairs, allowing for a different number of start/stop pairs per object. The explicit nested loop is probably the slowest and arguably least-expressive way to do this in R.

comments powered by Disqus