Please Stop Touting the "Data Scientist"

  • March 26, 2020

  • Colin Temple

Instead, let's tout what makes everyone's own skills uniquely valuable.

This article was originally posted here on LinkedIn.

Tim Wilson, of Digital Analytics Power Hour fame, recently wrote a great piece on LinkedIn about the precarious notion of a 'Citizen Data Scientist'. This is the idea that people who work in areas outside of data science or analytical roles can or will soon be able to take advantage of powerful data science technologies to glean the same kinds of benefits that the pros do without writing code, or studying statistics, or even having a basic idea of how any of it works.

I am not going to re-hash Tim's analysis of this idea. I recommend you read it, because it helps clarify some things about why that is not likely to succeed any time soon. Instead, in what follows I highlight some issues with the way we refer to 'data science' and 'data scientist'.

I don't like the way we use tend to use the term 'data science', generally. I have noticed that people use it in (at least) two importantly distinct ways, and that our sloppy use of the term sometimes masks what we want in hiring them and what they can do when they do their thing.

The title of this article is a reference to the title of Tim's, and so I do want to clarify that I am not against the term in general, but that I think that it's often inadequate or too general to be useful in conveying what people often mean when they use it. (This is a philosophical, something-to-think-about article, not a complaint, condemnation or an immediate call to action.)

Science Just is Data Science

All science studies data. Data in one form or another is always a proxy for what we are actually interested in, so when you undertake any empirical endeavor you do it through data. The scientific method itself involves the collection of data that is meant to evaluate whether or not a proposition, or set of them, can serve as a reliable generalization in our experience.

We can talk about subjects like computer science and information science, which can be thought of as being done a priori, that is, purely in the mind or without experience outside of it—but really, these are subsets of logic and mathematics, and not an empirical science in the same sense.

Now, we can quickly start splitting hairs there, but what I most dislike about the term 'data science' in practice is that it obscures what is actually being studied, at least the way it is frequently used.

Physics studies the physical world. Chemistry studies chemical structures and their interactions. Biology studies biological entities. Cosmology studies the origin of the cosmos. You get the idea. When we refer to sciences, we tend to refer to them by their domain of discourse; the thing they are about.

If 'data science' were a term like this, it would be about studying data itself. It would be about what data is, its nature, what forms it can take, how it is supposed to represent something else in the world, and so forth. It would not be concerned with the content of any data set, it would be a study marrying pure information theory with semantic concepts of logic and language.

It is my experience that data scientists, as clever as they are, don't tend to be worried about those kinds of philosophical concerns when they do their jobs. And that is because they aren't people engaged in a science of data, but something else altogether.

The Kind of Term 'Data Science' Is

So what do so-called data scientists study?

In marketing contexts it's usually some subset of human behaviours. Marketing contexts are common because, let's face it, the people who are suddenly interested in and paying for research in artificial intelligence and automating advanced statistics on large data sets are doing it in the context of marketing in one way or another. There's money there.

In other applications it is more fundamental physical processes, and in others, it is more abstract organizational movements. Data science is done in medicine and in economics and in political campaigning and traffic management and in supply chains and, yes, in trying to get you to buy some sneakers.

What is supposed to unify 'data science' is the emergence of these common methods to achieve reliable analyses using large data sets while being able to curate, maintain and secure that data in ways that make such analyses more reliable.

As a practice, data science is a term more akin to 'statistics' or 'engineer' or 'artist'. It's a very broad term that has more to do with the kind of activity being done than what is being studied. There is nothing wrong with that in principle; you can be an engineer or a statistician working in a particular field, and people roughly know what the general term means before you explain more. So, can't we just say that someone is a data scientist in a general sense?

Well, of course we can. We can say whatever we want. But what I think it loses is a clear communication of what is being done. Data science represents a set of things, but they're a diverse set of methods and technical skills that already exist. Machine learning is just statistical analysis done really fast on more data than people can typically keep track of, and often in ways that form some abstractions away from the underlying process to escape the limits of domain-specific, human-written algorithms. Data engineering is the same thing it has always been, it's just gotten better.

Applied physics and chemistry and pharmacology and psychology and economics and all of the domains under the vast umbrella of science have long used statistics across the board. There are people who are statisticians, as there are data scientists, but almost always they are people who are trained in and focused on such methods of inquiry who become practitioners, and maybe experts, in some domain of applying them.

The Context Matters

I am not saying that data scientists are not scientists in their respective domains per se, or that the thing that conceptually unifies them is invalid. But it's not a science and it's not the science of data. It's method. When it's done right, it is careful and highly effective method.

But that's very rarely enough. To analyze data you need some domain knowledge. Almost always, these things are part of some normative endeavor. We apply this method so that we can then use that knowledge for something. Maybe that's finding a vaccine to fight a global pandemic, maybe that's boosting customer lifetime value. Whatever it is, though, our ends are normative in the sense that we are not just establishing truth, we are trying to change the world to be something else. We are acting, or intend to act, in direct response to this data and whatever we infer from its rearrangement.

And so if you want to analyze data, your context matters. You need to understand your domain. You will need to know why the data you get is the data you get, which means understanding its technical origins, its known biases, its potential but unknown biases, what its structure and modelling represents in the world—good analysis has to account for those things. It's the same reason that looking at a dashboard full of KPIs, or whatever metrics you care about, is useless if you don't have any clue of why the numbers are what they are and how to move them in the direction you want them to go.

So, becoming an expert in method isn't enough to be useful as someone who can glean insights and make predictions that matter from the data. Everyone who works as a 'data scientist' does so in some field. They can change fields, of course, but some new knowledge is needed to be effective in a new one.

Tim Wilson also makes this point in his article, when he talks about feature engineering. Without conceptual understanding of the normative goal and how the data maps to the factors influencing outcomes there, no data scientist, nor any machine, can tell you anything useful about how to make predictions and make changes to affect the end result. Being aware of the context of the data and what you're trying to do with it is pretty essential to getting any value from it.

It's How You Use It

Ultimately, though, I am not trying to pick on the term 'data science' because it doesn't literally mean the thing it sounds like outside of cultural context. (I have a whole philosophical rant about how criticisms of word use are silly; you can hear it if you'd like to grab a beer sometime.) Rather, the issue I have is that conceptually lumping all of data science together can obscure, even to data scientists themselves, what matters about what they do.

If you are not aware of the context in which you use these methods of inquiry—of operating on data to make inferences and estimate outcomes—you will misinterpret the data. It happens to statisticians who don't think about psychological factors in surveys every day. It happens to digital analysts who are not aware of a JavaScript bug every day. It happens to working scientists who are hoping that some specific outcome justifies the grant that is funding them every single day.

At the scale we are growing into, those mistakes scale up as well. And so data scientists cannot afford to simply apply the methods of working with high volumes of data the same way anywhere they go; nor can a technology for statistical modelling truly be universal. They need to become knowledgeable in their domain, and work in close collaboration with domain experts who have the knowledge of context, and theory, and what has been proven to fail before.

Hiring Data Experts

The term 'data science' is widely used. It is sometimes conflated with its practice in specific domains, or used as if it were a specific science in itself.

When a marketing agency, for example, posts a job for a data scientist, rather than a marketing analyst or something domain-specific, what I often see as the motive for doing so is the desire sophisticated methods to be applied (sometimes inappropriately for the task at hand), or some sense of "modernizing" or "gaining insight" generally. The expectation varies, though. Sometimes, it seems that folks are after a generalist, someone who will wear many hats and just kind of do something with the data. In this sense, they might want a data scientist for what it means: someone with a bank of technical skills and methodologies at hand, who will ultimately need to learn the domain (but that's usually easier to teach than the skills of thinking analytically and understanding technology).

Sometimes this could be narrowed down to looking for a data engineer, an analyst, or someone to answer complex questions. In those cases, the role is not for a scientist at all, but a builder, or a journalist. That's not to devalue those roles in any way, but they don't deserve to be lumped into something that confuses what they do well.

At any rate, what you get when you seek a generalist is someone who can do the work, perhaps quite well, but is probably some ways away from thinking critically about the data they get or how to best wield it in service of the organization's goals.

This is the reason, as Tim rightly points out, that general data science all-in-one software solutions are typically ineffective. They don't have the conceptual knowledge required to understand and question the incoming data, and to be able to choose a goal, like identifying a question that's worth answering, or a change that would hypothetically improve outcomes.

When the expectation is for an expert in domain-specific analysis, things are different. This kind of person may have some or all of the data science techniques mastered, but they are well-versed in a particular kind of data, what it means, where it is biased, and how to work with it. But posting a job opening for a 'Data Scientist' isn't really going to attract that kind of talent, and a person in that role with the title 'data scientist' is obscuring the most valuable aspects of their fit in that role. That's where I most object to the use of the term, not because it's wrong but because it's not particularly clear or helpful.

I am not advocating for eliminating the term altogether, but amidst the hype of new technologies and the promises companies offering them make, it's important to keep in mind that there's a difference between someone who can practice a specific science, and someone who is ready to learn one. There is value in both. There's also often a price difference between them. But it's not fair to either to confuse this.

Tim wrapped up his article saying that we should not "kid ourselves into thinking that . . . we are mere months away from truly 'doing' data science simply by asking Siri to build us a model." A point I want to add is that we shouldn't kid ourselves into thinking that we are getting the right output, or reliable answers to questions, if we ask just any human who is good with data.