How to Prevent Data Pipeline Engineering Burnout
There are talented data engineers out there, but they may be spending up to 90 percent their time on manual DataOps tasks. These suggestions will help them avoid burnout and be more productive.
- By Ori Rafael
- May 16, 2022
I lived the life of a data engineer. The majority of my time as a data engineer was dedicated to writing or maintaining data pipelines. Every time I thought I was on top of things, something new would set me back. There was always a new request for faster analytics, new data in the pipeline, or scaling to an impossible level.
My day-in-the-life-of-a-data engineer story went like this:
- Fielding around five requests a day to create new tables, update schemas, and change transformations. (Of course, from the point of view of my internal customer, each of these requests was urgent.)
- Starting work at 2 a.m. because operations wouldn’t allow me to change data pipelines during work hours.
- Responding to calls from the network operations center about production data pipelines not completing, leading me to have to profile the problem, restart servers, increase server sizes, and clean temporary data that wasn't purged -- all under extreme time pressure.
I was burnt out. This drove me to change my life -- not only to change careers but to build a product and company that would address the insanity of data pipeline engineering. I joined Yoni Eini, a CTO in the Israeli Defense Force, to solve this problem for us (and the world).
Data Engineers Feel Burnt Out
It turns out that I was not the only one suffering burnout due to data pipeline engineering. We have been hearing a great deal lately about data engineers feeling exhausted and ready to leave their jobs. One aspect that is rarely talked about in this discussion is the role that manual operations for data pipelines play in the mounting frustration levels for data engineers.
According to an October 2021 study, 97 percent of data engineers reported experiencing burnout in their daily jobs. Nearly 80 percent reported they were considering switching careers. Four in five respondents (78 percent) wished they’d had a therapist available to help them cope with work stress.
This is troubling for a number of reasons. First, as tech leaders, we want professionals to be in jobs they enjoy and where they feel they are making important contributions without feeling overwhelmed. Second, as the U.S. deals with the Great Resignation and a large number of professionals are leaving their jobs and careers, the tech industry is feeling the pain.
Right now we need more -- not fewer -- data engineers. Data engineering is suffering from an intense talent gap already. To get a sense of how big the shortage is, I recently looked on LinkedIn and discovered that there are 217,000 open data engineer positions in the U.S. for only 33,000 employed data engineers, roughly a 6.5-to-1 ratio of jobs to people. This ratio eclipses that of data science, a profession which has for years been the skills deficit poster child. I also found roughly 398,000 open data science positions and 78,000 employed data scientists, a 5-to-1 ratio, which is bad but still a better situation than for data engineering.
Finally, the advent of big, complex, and streaming data, combined with real-time analytics and machine learning users, has made data engineering much more difficult than it was when I experienced this pain. Data engineers used to be productive with SQL and Oracle under their belts. Today they need to manage several data platforms, write production code on complex distributed systems, and perform more manual operations (orchestration, file system management, and state management, for example) than ever before.
There are talented data engineers out there, but if 90 percent of what they do is manual DataOps, they can’t be productive. Technology has the potential to make data engineers many times more productive by automating what’s manual today. That will free up data engineers to focus on delivering platforms instead of developing and maintaining pipelines, resulting in happier data engineers, faster analytics cycles, better data quality, and (usually) lower costs (because most pipelines are optimized for efficiency).
Many Data Pipelines Have a Problem
Each data pipeline is composed of two parts: Transformations (business logic) that take 10 percent of the time and manual pipeline operations (aka, pipeline ops) that take 90 percent of the time.
In other words, a mere 10 percent of a data engineer’s time is spent defining the business function of the data pipeline -- which is the value delivered -- while 90 percent is spent on production engineering. Why are data pipelines so engineering-intensive? Here’s a look at the fundamental pieces of work that data engineers perform as part of code-heavy pipeline ops:
- Orchestration (digital acyclic graphs, or DAGs)
- File system management on object storage
- Large state management
- Detecting schema and statistics from raw data
- Integration between data platforms
- Performance tuning for scale
- DevOps for managing and scaling computing clusters
If 90 percent of a data engineer's time is spent on these types of manual tasks as well as a never-ending break-fix cycle, it’s no wonder they are experiencing burnout.
Businesses Shouldn’t Overlook the Importance of Pipeline Ops
The rise of modern big data and modern data management systems has made data pipeline ops --the manual production engineering for data pipelines -- more challenging.
A breakdown in data pipeline ops has several important implications for businesses. First, they must deal with prolonged time to value for analytics projects, sometimes months or quarters from request to delivery. In addition, complex and detailed pipeline engineering is error-prone, leading to unreliable pipelines, poor data quality, and misinformed decisions. Finally, unoptimized data pipelines can generate substantial unforeseen costs to the business and its customers, especially on metered cloud services.
The good news is that there are ways to make modern data pipelines simple even for today’s complex requirements. Today, it's possible to implement a single integrated system for pipeline development and operations that enables data engineers to create pipelines at scale in days and manage them with much less effort.
Businesses can create an easier and faster path to production-grade data pipelines by focusing on three key areas:
- Eliminating code-intensive development of data pipelines by using well-known languages such as SQL. Using SQL expands the data pipeline developer audience by two orders of magnitude (all data engineers, some data consumers)
- Abstracting pipeline ops into a product that automates all manual operations related to production engineering: orchestration, state management, file system management, and data lake best practices, among other tasks
- Leveraging the cloud data lake as a processing backend because it’s the most reliable, scalable, and affordable infrastructure available
Recently tools have emerged that provide an automated, SQL-based approach to pipelines on a cloud data lake. They can act as a secret weapon/force multiplier for data engineers by speeding delivery of analytics-ready data. Digital-native companies such as IronSource and Sisense are using such new platforms at petabyte scale today to give data engineers a SQL-only experience for building pipelines on an AWS or Azure cloud data lake. You can, too.
About the Author
Ori Rafael is the CEO and co-founder of Upsolver, a no-code data lake engineering platform for agile cloud analytics. Before founding Upsolver, he held a variety of technology management roles at IDF’s technology intelligence unit, followed by corporate roles. Rafael has a BA degree in computer science and an MBA. You can contact the author via email, LinkedIn, or Twitter.