The Future of Data Science Lies in Automation
Parts of data science can be automated today, and more may be automated soon.
- By Julius Černiauskas
- March 20, 2023
Data science is a wide-ranging field that has been successfully applied in both scientific and business domains. Companies have been heavily investing in all things data in their quest to become data-driven.
With every business-minded investment comes the idea of optimization, and data science is no different in that regard. Although companies are pouring in money, they are also thinking of ways to make the most out of those resources. Automation is an inevitable part of optimization and often the first course of action.
Data science may seem like a field that’s nearly impossible to automate due to its inherent complexity. There are so many steps, from data extraction to modeling, all of which seem to require human input. We’ve thought that way, however, about many things and still found ways to automate processes.
Breaking Down the Parts of Data Science
Data science can be separated into several distinct parts, which together define the field. These are data exploration, data engineering, model building, and interpretation.
Data exploration largely revolves around discovering the needs, goals, and requirements of a particular task. For example, an e-commerce business might have a reason to need all pricing data for a specific category from a variety of regions. Each needed data set has to come from some source (or a multitude of them), however, it’s not always clear how to find the right data.
Additionally, exploration will often involve working with some data sets to discover goal-driven questions, the potential for visualization, etc. These aspects require quite extensive human judgment and are domain- and goal-specific. As a result, automation for data exploration is likely somewhat far away.
Data engineering -- which is the process of actually acquiring, labeling, wrangling, and transforming data -- is often the most time-consuming aspect. Unfortunately, we have had little success in automating these tasks. It is possible to do so, however, mostly when a functioning and accurate model already exists. Automating labeling on novel data sets, however, still remains challenging.
The other two parts, however, have much more potential. Data interpretation, to some surprise, has been shown to have the potential for automation. In 2014, a group of researchers created a natural language model that could interpret basic regression models (and even draft a full report with explanations) with an impressive degree of veracity.
Since then, various business implementations have aimed to do the same thing for more actionable, less academic insights. Numerous companies, such as PowerBI, have integrated automated insight generation, albeit at a somewhat limited capacity. Soon enough, I believe we’ll get complete overviews from business intelligence systems.
Model building -- the practice of selecting algorithms, tuning parameters, evaluating performance, and creating machine learning models -- has already seen a decent degree of successful automation through AutoML.
The Role of AutoML
Much data science work is done through machine learning (ML). Proper employment of ML can ease the predictive work that is most often the end goal for data science projects, at least in the business world.
AutoML has been making the rounds as the next step in data science. Part of machine learning, outside of getting all the data ready for modeling, is picking the correct algorithm and fine-tuning (hyper)parameters.
After data accuracy and veracity, the algorithm and parameters have the highest influence on predictive power. Although in many cases there is no perfect solution, there’s plenty of wiggle room for optimization. Additionally, there’s always some theoretical near-optimal solution that can be arrived at mostly through calculation and decision making.
Yet, arriving at these theoretical optimizations is exceedingly difficult. In most cases, the decisions will be heuristic and any errors will be removed after experimentation. Even with extensive industry experience and professionalism, there is just too much room for error.
AutoML systems, such as Python libraries (e.g., Auto-sklearn), use advancements in mathematics and computer science to automatically select algorithms and fine-tune parameters. Research and experimentation have shown that various AutoML systems can often optimize pipelines and deliver accurate results at uncanny rates.
Although AutoML does not and will not completely automate data science, it has the potential to take a significant portion of manual work off the shoulders of humans. Its potential lies in simplifying a usually difficult part of machine learning.
Making Machine Learning Easier
Automation is not only about optimizing resource costs; it also removes the barrier to entry for some activities. Machine learning has two major hurdles to its accessibility.
Data acquisition and engineering is the first obstacle. However, data acquisition has been made easier through the emergence of web scraping, public data sets, and other phenomena. Labeling and wrangling still remain largely unchanged, but finding the necessary data has often been the primary challenge in data science.
AutoML, however, makes machine learning more accessible by reducing the requirements for creating an optimized model. Currently, the technology can still run into issues when high-quality data is not available, so it’s definitely not a cure-all, and general machine learning knowledge is required.
Within the near future, however, AutoML has the most potential to completely automate a part of data science and provide easier access to the field for less experienced practitioners. Additionally, large language models or natural language processing will aid data scientists in producing easy-to-read interpretations.
Finally, I expect that data engineering will be next in line for automation. Data integration, normalization, and extraction can already be automated, and all that is needed is to find solutions that can be scaled.
About the Author
Julius Černiauskas is the CEO of Oxylabs -- the formerly small startup in the public data collection industry that now employs over 400 specialists. Since joining the company in 2015, Černiauskas successfully transformed the basic business idea of Oxylabs by employing his knowledge of big data and information technology trends. He implemented a brand-new company structure that led to the development of a sophisticated public web data gathering service. Today, he leads Oxylabs as a global provider of premium proxies and data scraping solutions, helping companies and entrepreneurs to realize their full potential by harnessing the power of external data. You can reach him on LinkedIn.