The Data Scientist Dilemma

Steve Taitoko
Jul 18, 2022
3 min read

As a term, ‘data science’ is a dominant feature within our information systems lexicon. It is referred to often, but do we truly understand what it is and how best to unleash its power as a force for good?

The concept was first used in the early 1970s, however, data science as a term wasn’t codified until William S. Cleveland’s 2001 paper. Since then, it has been interchangeably used in emerging areas such as big data, artificial intelligence, and IoT, with little regard to how to apply and extract the greatest value from it. As a result, significant barriers exist to the mainstream adoption and scalability of data science.

At its core, data science encompasses data strategy, engineering, analytics and visualisation. All of this must then be wrapped within the context of particular domain expertise to be useful. Our typical response has been to wrap responsibility for all of this into one tidy role description.

Finding one, let alone the hundreds of thousands of individuals required who have the skill, experience, and wherewithal to meet the expectations of that job description, is like finding the proverbial unicorn. Welcome to the world of the data scientist, and thus the pinch point of our data science supply chain.

In a Kaggle State of Data Science and Machine Learning survey of over 16,000 data professionals, a number of recurring themes were identified as challenges faced by data scientists, including data curation, management support and budget approval.

Given the vast scope of responsibility we are asking our data scientists to take on and the structural barriers in their way, we have a mountain to climb to reach the full potential of advanced analytics.

To resolve the issue of the vast scope of a data scientist role, we must look at ways to treat data science as a collaborative workflow and not a single data scientist role. Breaking out the core functions of strategy, engineering, analytics and visualisation into distinct roles, will be a good start to build cross-functional teams with domain expertise in each function.

To resolve the second challenge of breaking down barriers to better data management, we must address some of the following core issues:

1. The 80/20 Problem

It is estimated that a data scientist spends 80% of their time finding and cleaning suitable data before they can even use it. This creates an even wider gap between the application of their domain expertise and the scale of problems they are solving.

2. Structure of Data

The way in which we are interacting with data is changing rapidly. For instance, it is estimated that 75% of data will be generated and processed on the edge by 2035, a big shift from the current 10%. Combine this with how data silos are allowed to exist, and we have constructed the perfect barriers to full data flow and collaboration.

3. Data Governance

Data privacy, sovereignty, and shareability are significant barriers to access and collaboration. Without coherent governance and policy management built into data management tools, interoperability within and external to organisations is almost impossible.

4. Scale of Growth

Often referred to as data velocity, incoming data is growing exponentially. It is estimated that we currently have over 44 zettabytes of stored data. To put that in context, if you took an average smartphone’s storage capacity, 44ZB would fill 188 trillion phones, which if laid end-to-end would circumnavigate the earth about 670 times.

By 2025, total data in storage is expected to be 124ZB. A threefold increase in data storage at the scale of data that already exists presents a big problem given the way in which our data is currently architectured and governed.

Conclusion

These hardwired and structural barriers have resulted in the burden of ingesting and processing data at scale falling on the few. To resolve this, we must shift ownership to the highest levels of our institutions. This will involve: redesigning our systems; building strong governance settings; establishing data teams; creating better tools; and, providing pathways to training.

Ultimately, to truly become AI-ready we must build a data culture capable of allowing science to flourish, unlocking new value, and solving the wicked problems faced by humanity.

Share
Authors	Steve Taitoko
Cover Image	NASA