TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
  - TDWI Digital Dialogue | AI Governance in Practice: Operationalizing Governance for Enterprise AI
- Webinars
  - Creating an AI-Ready Organization – Results of New TDWI Best Practices Research June 27, 2025
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Data Quality in the Age of Big Data

Traditional data quality best practices and tool functions still apply to big data, but success depends on making the right adjustments and optimizations.

By Philip Russom
April 19, 2019

Whether data is big or small, old or new, traditional or modern, on premises or in the cloud, the need for data quality doesn’t change. Data professionals under pressure to get business value from big data and other new data assets can leverage existing skills, teams, and tools to ensure quality for big data. Even so, just because you can leverage existing techniques doesn’t mean that’s all you should do. We must adapt existing techniques to the requirements of the current times.

For Further Reading:

Data Quality Predictions for 2019

CEO Q&A: Data Quality Problems Will Still Haunt Your Analytics Future

Data Quality Evolution with Big Data and Machine Learning

Data professionals must protect the quality of traditional enterprise data as they adjust, optimize, and extend data quality and related data management best practices to fit the business and technical requirements of big data and similar modern data sets. Unless an organization does both, it may fail to deliver the kind of trusted analytics, operational reporting, self-service functionality, business monitoring, and governance that are expected of all data assets.

Adjustments and Optimizations Make Data Quality Tasks Relevant to Big Data

The good news is that organizations can apply current data quality and other data management competencies to big data. The slightly bad news is that organizations need to understand and make certain adjustments and optimizations. Luckily, familiar data quality tasks and tool functions are highly relevant to big data and other valuable new data assets -- from Web applications, social media, the digital supply chain, SaaS apps, and the Internet of Things -- as seen in the following examples.

Standardization. A wide range of users expect to explore and work with big data, often in a self-service fashion that depends on SQL-based tools. Data quality’s standardization makes big data more conducive to ad hoc browsing, visualizing, and querying.

Deduplication. Big data platforms invariably end up with the same data loaded multiple times. This skews analytics outcomes, makes metric calculations inaccurate, and wreaks havoc with operational processes. Data quality’s multiple approaches to matching and deduplication can remediate data redundancy.

Matching. Links between data sets can be hard to spot, especially when the data comes from a variety of source systems, both traditional and modern. Data quality’s data matching capabilities help validate diverse data and identify dependencies among data sets.

Profiling and monitoring. Many big data sources -- such as e-commerce, Web applications, and the Internet of Things (IoT) -- lack consistent standards and evolve their schema unpredictably without notification. Whether profiling big data in development or monitoring it in production, a data quality solution can reveal new schema and anomalies as they emerge. Data quality’s business rule engines and new smart algorithms can remediate these automatically at scale.

Customer data. As if maintaining the quality of traditional enterprise data about customers isn’t challenging enough, many organizations are now capturing customer data from smartphone apps, website visits, third-party data providers, social media, and a growing list of customer channels and touchpoints. For these organizations, customer data is the new big data. All mature data quality tools have functions designed for the customer domain. Most of these tools have been updated recently to support big data platforms and clouds to leverage their speed and scale.

Tool automation. Big data is so big -- in size, complexity, origins, and uses -- that data professionals and analysts have trouble scaling their work to big data accurately and efficiently. Furthermore, some business users want to explore and profile data, spot quality problems and opportunities, and even remediate data on their own, at scale and in a self-service manner. Both scenarios demand tool automation.

Tools for data quality have long supported business rules to automatically make some development and remediation decisions. Business rules are not going away -- multiple types of users still find them useful, and many have a large library of rules they cannot abandon.

For Further Reading:

Data Quality Predictions for 2019

CEO Q&A: Data Quality Problems Will Still Haunt Your Analytics Future

Data Quality Evolution with Big Data and Machine Learning

Business rules are being joined by new approaches to automation that have recently arrived for a variety of data management tools, including those for data quality. These usually take the form of smart algorithms that apply predictive functions, based on artificial intelligence and machine learning, to automatically determine what the state of data is, which quality function to apply, and how to coordinate these actions with developers and users.

Data Quality Must Adopt the New Paradigms of Modern Data Management

Practices for data quality (and related practices for data integration, metadata management and customer views) must be altered to follow different paradigms. Note that in the following examples most of the paradigm shifts are necessary to meet new requirements in big data analytics.

Ingest big data sooner, improve it later. One of the strongest trends in data management is to store incoming data far sooner so that big data is accessible as early as possible for time-sensitive processes such as operational reporting and real-time analytics. In these scenarios, persisting data takes priority over improving data’s quality. To accelerate the persistence of data to storage, up-front transformations or aggregations of data are minimal or omitted under the assumption that users and processes can make those improvements later when big data is accessed or repurposed.

Big data quality on the fly. The ramification of these paradigm shifts is that data aggregation and quality improvements are increasingly done on the fly -- at read time or analysis time. This pushes data quality execution closer to real-time. Furthermore, on-the-fly big data quality functions are sometimes embedded in other solutions, especially those for data integration, reporting, and analytics. To enable embedding and achieve real-time performance, modern tools offer most data quality functions as services. Luckily, today’s fast CPUs, in-memory processing, data pipelining, and MPP data architectures provide the high performance required to execute data quality on the fly at big data scale.

Preserve big data’s arrival (original) state for future repurposing. A newly established best practice with big data is to preserve all the detailed content, structures, conditions, and even anomalies that it has when it arrives from a source. Storing and protecting big data’s arrival state provides a massive data store -- usually a data lake -- for use cases that demand detailed source information. Use cases include data exploration, data discovery, and discovery-oriented analytics based on mining, clustering, machine learning, artificial intelligence, and predictive algorithms or models.

Furthermore, the store of detailed source data can be repurposed repeatedly for future analytics applications whose data requirements are impossible to know in advance. Data that is aggregated, standardized, and fully cleansed cannot be repurposed as flexibly or broadly as data in its arrival state.

Data quality in parallel. The best practice today with Hadoop, data lakes, and other big data environments is to maintain a massive store of detailed raw data as a kind of source archive. Instead of transforming the source, users make copies of data subsets needing quality improvements and apply data quality functions to the subsets. Similarly, data scientists and analysts create so-called data labs and sandboxes where they improve data for analytics. This “data quality in parallel” is necessary to retain the original value of big data while creating a different kind of value through mature data quality functions.

Context-appropriate data quality. Analytics users today tend to alter big data subsets as little as they can get away with because most approaches to modern analytics tend to work well with original detailed source data, and analytics often depends on anomalies for discoveries. For example, nonstandard data can be a sign of fraud, and outliers may be harbingers of a new customer segment. As another example, detailed source data may be required for the accurate quantification of customer profiles, complete views, and performance metrics.

For More Information

For an in-depth discussion of data quality, read the 2018 TDWI Checklist Report: Optimizing Data Quality for Big Data here. Many of the key points discussed in this article are drawn from that report.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Quality in the Age of Big Data

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Quality in the Age of Big Data

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career