Three stories about the rise of data science programs
22 May 2022Data science programs of various types seem to be proliferating, and everyone seems to have a different notion of what, precisely, “data science” is. I’ve been watching this trend for years mostly as someone looking to maybe enter that world. Recent efforts to create a data science minor at Middlebury got me thinking about it again as a quasi-disinterested observer of college/university decision-making. I’ve come up with three stories on why data science programs are emerging at colleges and universities. These stories aren’t exhaustive or mutually exclusive, but I think they capture the essence of what I’m seeing.
1. Data science as an expression of the needs/fruits of capital/technology
I think this is what most people point to when making the case for a new data science program. Moore’s Law, the proliferation of open-source software, massively scalable cloud + distributed edge computing, and the explosion of data associated with these forces all create strong demand for a class of workers with a (somewhat) unique blend of skills. They need to be able to process data and do some analysis, create digestible summaries (often for executives or other decision-makers), and ideally do it all in a web-ready format using some amount of cloud capacity. But they don’t need to be statisticians or computational experts—those roles (still) often demand enough specialized knowledge to require more-advanced training. These pressures lead to data science programs that tend to have a lower level of statistical sophistication than pre-existing data-centric disciplines, like an econometrics focus or a stats minor. In exchange they can deliver more data acquisition and analytics communication skills. I think this view is pretty compatible with a “data science” major/minor, albeit one which may be mostly a convex combination of existing programs. There’s a non-trivial overlap between this type of data science and “business analytics” programs.
I think existing quantitative (i.e. not “stocks for jocks”) economics and business programs are probably the closest thing to this type of data science program. They tend to feature a statistical sequence emphasizing a blend of theory, applications, and interpretation (mostly the latter two); a theory sequence focused on the relevant modeling foundations (e.g. logics for considering certain variables in a model, domain-specific patterns to look for to ensure model validity); and applied courses really focused on topic-specific issues (e.g. environmental issues in an environmental econ courses). They a focus on developing digestible analytics products for managers or investors (more true of economics programs than business programs), on programmatically grabbing data from the web or specific servers (I think true for both, though maybe more for economics programs), and on working as part of a larger data generation/collection-storage-analysis pipeline (maybe true for both). To the extent that this story is leading to data science programs, I would expect to see them either not emerge where there are already business analytics programs. In those places I might expect to see an existing business analytics program morph into a data science program (at least partly for the branding). Where there is a strong economics department and/or business school, I would predict involvement from those faculty and courses.
2. Data science as a category for grouping scientific activities
This story holds that the rise of data science reflects the need for a semantic grouping of a class of activities that aren’t necessarily new, but are perhaps becoming more common. The terms “natural” or “social” science describe topics that a scientist studies, and the term “lab science” describes a methodology for conducting science. “Data science”, then, is a term for describing a different class of scientific methodologies. Loosely, in this story, “natural science:social science::lab science:data science”. These pressures lead to data science programs that codify and legitimize existing practices and activities, bringing them under a common umbrella. This can be valuable, especially if you believe that cool things happen when people with compatible skills/interests in different disciplines share a roof, e.g. cool new interdisciplinary work. But it’s not a set of pressures that really tends to motivate fundamental changes in how disciplines work—the change has already happened and the program’s rise reflects that, and/or the new “data science-y” variant of the field finds itself a bit on the fringes of its own discipline. In a sense, the emergence of this type of data science is the inverse of the process that led to the formation of political science: a grouping of disparate work under a common umbrella without a clear agenda (DS), rather than a fracturing of an established field into a new one focused on a specific agenda (PS). I’ve seen some research, like Jessica Hullman’s work, that I think could be classified as “data science first” rather than just “data science by way of another discipline”. But such research seems more the exception than the rule. I don’t think there’s much of an agreed-upon “data science” research agenda or many field-specific institutions guiding agendas yet.
This story came to me from reflecting on a colleague’s statement, “don’t all sciences use data?” Yes, surely, and how is that reflected in our language? We call some things “lab sciences”. Google tells me Oxford defines “laboratory” as “a room or building equipped for scientific experiments, research, or teaching, or for the manufacture of drugs or chemicals.” The drugs and chemicals bit can’t be essential to the definition, else we wouldn’t really consider many “labs” in physics or biology as “labs”. The bit about “equipped for scientific experiments, research, or teaching” could apply to nearly any natural or social science. Even economists run experiments! Even sociologists teach! Do we then consider economics or sociology as “lab sciences”? Not usually. We reserve the term mostly for “natural sciences”, where we view experiments as a key part of methodology. “Data science” provides a term for fields or subfields built around data analysis rather than data generation. In this view, a “data scientist” is like a “lab technician”: an individual trained in any of a set of fields, skilled in particular methodologies, and not necessarily wedded to a single topic. In this view, a program offering a “data science” major/minor is a bit odd—would we consider a “lab science” major/minor, training students in how to use laboratory equipment at large?—but a “data science” job classification or working group is perfectly logical and even useful.
3. Data science as an expression of shifting power relations between factions
This story views the emergence of “data science” programs as an expression of some programs growing/evolving/colonizing/dying/being colonized by data-centric philosophies. From what I’ve seen math, stats, and computer science seem to be typical initiators of these processes, e.g. the rise of computational linguistics. This doesn’t have to be an active takeover—there are often real synergies to these collaborations. But fundamentally in this story the emergence of a data science program is about specific factions within a college or university accumulating instititional power.
This take came to me from something Miles Kimball once said, something along the lines of, “you can tell how much a field feels it’s exhausted the low-hanging fruit in its core problem area by how much it tries to branch out into other areas.” Clearly there are many fascinating problems left in math/stats/CS. But even if there are low-hanging fruits, a program unable to attract the talent and resources necessary to chase those fruits may find itself forced to branch out, e.g. a CS department that cannot attract/retain leading scholars but is nevertheless active and seeking to do exciting work. I’ll call this “Kimball’s necessary condition for field branching”. (If it were a a sufficient condition I would expect see more small/under-resourced physics departments initiating data science programs.)
I think the other necessary condition here is a pursuit of institutional power. Programs which train relatively many students and can obtain consistent and unrestricted funding streams (independent of existing streams controlled by an advancements or alumni relations office) have the potential to accumulate institutional power—primarily resources and influence within/beyond the college or university. In places with programs subject to Kimball’s necessary condition, this story can generate data science programs where the first two stories may not—particularly if other conditions for power accumulation (e.g. rising relative enrollments, greater external prestige) are met.
How does one identify a data science program primarily oriented around gathering institutional power? One marker, I think, is rejection of existing power centers within the institution, particularly those which are data-centric but not subject to Kimball’s necessary condition. Such groups pose a threat to a power-oriented data science program. If let into the fold, the existing power centers can use their existing power to claim a large share of the rents generated by a data science program. Another is a lack of clear connection to story #1. A data science program which doesn’t emphasize specific skill packages and outcomes (or lacks a credible plan to deliver them) seems to be less based in story #1 than one which does. Note that power accumulation is completely compatible with story #2. Indeed, groups which already “do data science” but are not very powerful in their institutions may face a strong incentive to agglomerate and form a program that gives them an umbrella and some clout.
Reflections
In general I think stories #1 and #3 can be sufficient conditions for a data science program to emerge. I don’t think story #2 can be—maybe for a data science working group or something, but not for a full-fledged program, at least not without surplus resources that another faction within a college or university hasn’t claimed. Story #2 just doesn’t generate the potential for rents that (I think) is necessary to motivate and strengthen costly efforts to claim additional resources. Story #1 generates rents to college or university stakeholders primarily from outside entities, e.g. greater aboslute enrolments or wealthier alumni who donate more. Story #3 generates rents, but a non-trivial share may be transfers from existing data-centric fields towards the data science program initiators. New rents at the college/university level are generated if and only if the data science program is able to induce otherwise non-data-oriented students to enroll and improve their future earnings or social standing. If there are already data-centric power centers in the college or university, then most such students may already be claimed. A minor, then, may make more sense than a major—a low-commitment way to get students in seats and signal strength to other factions.
Story #1 is arguably consistent with any college or university’s mission, though it may be better received at some (e.g. technical colleges with a compatible focus, or universities focusing on expanding educational access to historically-underserved populations) than others (e.g. elite liberal arts colleges with high post-grad employment rates). Story #2 may or may not be mission-consistent, but provided the resource utilization is relatively low, it’s unlikely to be harmful. Story #3 is a bit trickier. It’s not obvious to me when or whether internecine power struggles can support a college or university mission. Perhaps a useful test is, “what new capacities does this program enable?” If it enables new sufficiently-valuable new capacities, perhaps the existing power balance was in need of disruption. Else, perhaps the struggle is simply a zero- or even negative-sum game.