Mastering AI Quality: Strategies for CDOs and Tech Leaders
AI/ML’s opaque processes challenge data quality and decision-making, emphasizing the need for understanding and control.
- By Anshuman Kanwar
- November 9, 2023
Most chief data officers (CDOs) work hard to make their data operations into “glass boxes” --transparent, explainable, explorable, trustworthy resources for their companies. Then comes artificial intelligence and machine learning (AI/ML), with their allure of using that data for ever-more impressive strategic leaps, efficiencies, and growth potential.
However, there’s a problem. Nearly all AI/ML tools are “black boxes.” They are so inscrutable even their creators are concerned about how they produce their results.
The speed and depth at which these tools can process data without human intervention or input presents a danger to technology leaders seeking control of their data and who want to ensure and verify the quality of analytics that use it. Combine this with a push to remove humans from the decision loop and you have a potent recipe for decisions to go off the rails.
For example, I recently encountered a large language model (LLM) tasked with creating a seating chart for employees at a company event. The LLM hallucinated and inserted a nonexistent name into the seating arrangement. The mistake was caught only after the invites were printed, even though a data quality check (e.g., checking that all employees in the model's output were valid) could have caught this issue much earlier. This is an innocuous example, but one can easily imagine a higher-stakes application with more severe damage.
Does this mean organizations should heavily restrict their use of AI/ML? No. So far, the benefits outweigh the dangers. However, there are ways to maximize the former while minimizing the latter. The first step is understanding the weaknesses of AI tools.
Reproducibility
There is a reproducibility crisis in science. Researchers have been unable to replicate research results across several fields for years. For example, a Princeton survey found errors in more than 300 studies that applied ML algorithms. Evaluators of AI tools must first ensure each evaluation's results are reproducible. If different iterations of a test run on the same data produce different results each time -- or if slight changes to input result in wildly divergent outputs -- these unreliable results indicate problems with the product. Good tools -- whether AI algorithms or hand tools -- must yield reproducible, consistent, predictable results.
Transparency
With a human collaborator or a human-designed algorithm, it is generally easy to elicit a meaningful response to the question, “Why is this result what it is?” With AI -- and generative AI in particular -- that may not be the case.
The most impressive AI tools are pattern-matching machines; often, the patterns AI tools see are subtle or convoluted. These algorithms do not lend themselves to simple explanations; the field of explainable AI (XAI) aims to provide insight into AI outputs, but it is still a developing area.
Although it is easy to dismiss or forgive the lack of explainability and transparency of AI outputs, the danger is that these tools can extend upon and amplify bias, especially bias that has crept into the training data set.
An infamous example of AI bias occurred within a large tech company’s job candidate selection process, where AI tools were trained on desirable traits from its existing workforce. Because the list of desired traits was based mainly on the company’s predominantly male workforce, most women applying were automatically eliminated. Similar examples of people of color trying to secure mortgages and insurance have also offered a cautionary tale.
Compliance
AI is also a greedy consumer of data. Modern LLMs are built on vast libraries of content, much of it lifted from the internet, and often include content that was not designed or licensed to be used for this purpose. Unrestricted AI tools can also absorb user input -- including confidential materials -- to improve their models over time.
Controlling compliance with privacy and security laws should not be difficult. It should be clarified up front how models use and retain input and what their training models include.
However, another challenge is monitoring AI tools to guarantee they are making good on their compliance claims. Many companies are adopting strict usage guidelines for AI to restrict the potential leakage of protected information into third-party AI systems.
The Solution: You Need Interoperable Data to Unleash AI
Effective and safe use of AI requires a robust data strategy that addresses these factors. AI does not solve many organizations' most crucial problem -- untrustworthy, siloed data. This is especially important as organizations regularly add new data sources or apps. Salesforce research, for example, reported the average number of apps across an organization in 2023 was 1,061, up from 843 in just two years. Data resides in over 800 enterprise applications, on average, while only one-third are connected. Even if one app feeds conflicting, incomplete, or erroneous data into an AI analytics tool, the impact of minor errors could eventually create large deviations from accurate results.
Enterprises need interoperable data to fuel a winning AI/ML strategy. Interoperable data refers to trusted information harmonized, purified, and augmented across many silos and data sources within an organization. It seamlessly adapts to the fluid requirements of a business, expediting decision-making processes and energizing operational workflows. Interoperable data is fully mobilized and readily available wherever, whenever, and in whatever format is required. In the realm of digital transformation, interoperable data is the linchpin. It’s impossible to achieve true transformation without it.
How does an enterprise achieve data interoperability? A robust data unification and management solution applies entity resolution, multi-domain SaaS master data management (MDM), and 360 data product capabilities to help enterprises transform poor-quality data from disparate sources into unified, trusted, and interoperable data. Modern approaches are built in the cloud and rely on application programming interfaces (APIs) to automate, scale, integrate, and deliver real-time interoperable data, among other things, to the downstream data consumers, including AI models.
In my experience, data stewards can manually resolve 75 entities per day. This tedious, time-consuming task entails looking at two or more records and determining whether they are the same entity (person, place, product, among others). I have seen how novel data unification management approaches improve this to 10 times that figure within a modern MDM system. The automation alleviates the burden of manual matching -- of most but not all -- ambiguous data records. By entrusting these ambiguous matches to advanced models, we can significantly reduce human involvement and free data stewards for higher-impact work. For the residual ambiguous matches that persist, the industry continues to explore solutions to expedite their resolution.
Conversational or consultative user interfaces could be pivotal in streamlining this process -- improving the ambiguous match to 100 times improvement over the manual benchmark standard of 75 per day. This is critical as enterprises continue adding more data sources and compounding a massive challenge -- data sprawl.
The Foundation of AI Success: Building Trust through Interoperable Data
To unlock the transformative potential of AI and ML outcomes, CDOs should prioritize interoperable data by adopting business-responsive technology platforms. These platforms are engineered with three critical features at their core:
- They facilitate real-time, bi-directional data flows between applications, transcending format or source limitations. This seamless data exchange is vital for feeding AI and ML algorithms with the diverse and up-to-the-minute data they need to generate actionable insights.
- These platforms offer flexible data models that effortlessly adapt to shifting business requirements. In AI and ML, where the data landscape is in constant flux, this adaptability ensures that data remains a valuable and relevant resource for training and refining algorithms.
- A modern design philosophy underscores these platforms, emphasizing cloud-native architecture and an API-first approach. This approach isn't just about staying technologically up-to-date but is integral to AI and ML integration. It enables companies to scale their AI and ML initiatives, seamlessly integrate new applications, and harness data-driven insights with agility and efficiency.
Interoperable data unlocks the full potential of AI and ML outcomes for enterprises needing digital transformation. By investing in business-responsive platforms with real-time data flows, adaptable data models, and modern design principles, organizations can unlock the power of AI and ML to drive innovation, enhance decision-making, and gain a competitive edge in today's data-driven landscape.