Data Preparation: Advanced Analytics to the Rescue
Demand is growing for faster, more dependable data preparation. According to a recent survey by TDWI, professionals believe there is plenty of room for improvement.
- By David Stodder
- June 3, 2016
Data preparation, once humbly known as data plumbing, is one of the hottest topics in our industry. Demand is growing for faster, more dependable, and more repeatable processes that can take data drawn from multiple sources from ingestion to its ultimate digestion by users of business intelligence (BI), visual data discovery, and analytics applications.
Vendors are addressing this demand by offering products that innovate -- using machine learning and other analytics. Self-service data preparation is also a strong trend, focused on enabling nontechnical users to work with automated routines instead of losing time to often dull, error-prone manual processes.
At the end of June, TDWI will publish a new Best Practices Report, Improving Data Preparation for Business Analytics. Working on this report was fascinating, and I look forward to sharing it with the TDWI community.
As part of the research, we extensively surveyed business and IT professionals. The full report will analyze the results of this survey in the context of critical issues and objectives in data preparation, including data quality, data integration, and data cataloging. What follows is a preview of our findings, focused on participants’ satisfaction with their current data preparation.
Attributes of Effective Data Preparation
The bottom line for data preparation is how well the resulting data meets users’ requirements for data quality, completeness, and relevance. In the survey, TDWI asked participants how satisfied their organizations are with some important attributes of their data.
In other words, are their current data preparation processes effective? We can look at four attributes of good data and examine what the survey results say about satisfaction with data preparation processes for achieving them.
Accuracy, Quality, and Validity
These attributes are the heart of data preparation processes and become critical when multiple sources are integrated and blended for BI and analytics. To be accurate, data values must be correct and in the right form; otherwise, reporting and analytics could be wrong or invalid. Data validation constraints and processes help control what goes into data sets so that the data remains clean and correct.
Some data preparation tools are able to move data through a “pipeline” -- an integrated workflow that improves quality over a series of steps. Some of these steps help organizations spot problems in the data and address the causes as the data is being loaded into data warehouses or other systems.
Although it’s not a ringing endorsement, research participants are the most satisfied with processes for achieving this attribute; 14 percent are very satisfied and 46 percent are somewhat satisfied.
Frequency of Data Refresh
Organizations need fresher data today to satisfy requirements for operational BI and near-real-time analytics, possibly through continuous, incremental data refreshes. Sometimes, however, the fresher the data, the lower its quality; there may not have been time to run proper data quality processes.
Satisfaction with processes for providing the right frequency of data refreshment is similar to what we saw in the previous category; 16 percent are very satisfied and 41 percent are somewhat satisfied.
Consistency Across Data Sets
This attribute is critical to building trust in data, especially as requirements for business analytics push organizations to touch multiple sources to test variables and answer questions. Traditionally, organizations use profiling tools to spot inconsistencies in the data such as spelling variations in names and addresses.
To support BI reporting, data must be provisioned consistently so that trends and comparisons are valid; this is one reason why putting the data in a single, prepared store such as a data warehouse makes sense for BI reporting. For visual data discovery and other types of analytics, perfect consistency may not be as critical. Users and analysts may actually be interested in some inconsistencies because they may be meaningful.
Organizations are increasingly using techniques such as machine learning and fuzzy matching for entity resolution, data matching, and identity resolution to locate inconsistencies and determine what to do about them.
Research participants report that their organizations are less than satisfied with current processes for delivering consistency across data sets; just 9 percent are very satisfied, 30 percent are somewhat satisfied, and 36 percent are either somewhat dissatisfied or not satisfied.
Flexibility to Change Data for Ad Hoc Needs
Flexibility and agility are critical to supporting self-service BI and visual analytics users. In many organizations, however, the BI, ETL, and data warehousing infrastructure was built for standardized query and reporting. As analytics become more prevalent, users will want to access new data sources or see changes made to data in their existing sources more frequently.
Self-service data preparation may take some of the load off IT to support demands for flexibility. In our survey, research participants indicate a fairly low level of satisfaction with their data preparation processes for achieving this attribute; only 7 percent are very satisfied and 29 percent are somewhat satisfied; 40 percent are either somewhat dissatisfied or not satisfied.
Data Preparation: Getting Smarter
The technology community is excited about the potential for improvement in data preparation through automated use of data mining, machine learning, and other advanced analytics techniques. Data could be prepared faster and with greater user satisfaction and flexibility. Organizations should evaluate these technologies to determine whether they address current and future needs better than existing technologies and processes.
Our research finds that the appetite for improvement is already strong. As organizations grow to rely on analytics to support data-driven decisions, it will undoubtedly increase.
About the Author
David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years.