TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

How to Judge a Training Data Set

AI practitioners need to be aware of best practices in training data preparation, as well as the myriad ways in which to avoid or reduce bias in your data set.

By MingKuan Liu
August 6, 2021

What makes a good training data set? In artificial intelligence (AI), this is one of the most important questions for practitioners to answer. Having a foundation of good training data sets you on a path to success in developing accurate, unbiased models. Working actively to source representative data, label it correctly, and monitor it for bias are essential steps in launching AI that works well for all of your end users.

As an AI practitioner, you need to be aware of best practices in training data preparation, as well as the myriad ways in which to avoid or reduce bias in your data set.

For Further Reading:

Overcome Data Shortages for ML Model Training with Synthetic Data

The Machine Learning Data Dilemma

Artificial Intelligence and the Data Quality Conundrum

Data Best Practices

A good data set is both representative and balanced. Let's explore what that should look like in key facets of the data preparation process:

Data sourcing. If possible, collect data from multiple sources to maximize data diversity. Research your end users in advance. Are they adequately represented in your data set? Be aware of all the potential use cases your end users will need (including outliers) and ensure the data you've collected matches those scenarios. As you collect data, continually analyze your data set from the perspective of your end users; you may be surprised to find you must acquire additional data to fill in the gaps.

Data labeling. Create a gold standard for your data labeling. If you need assistance, consider leveraging data collection and data labeling domain expertise from a third-party data provider. They can review your data management guidelines and offer additional improvements and best practices based on their knowledge and experience. In any case, provide your data annotators with clear guidelines so they are aware of what's expected of them. If needed, continue to adjust these guidelines based on feedback from annotators.

Data monitoring. The data your model encounters in the real world will often shift over time because models typically don't operate in static environments. Even after deploying your model, monitor and analyze your data routinely to catch potential model drift. Establish a plan for retraining your model with new training data when model drift does occur.

Bias Patterns

Reducing bias is one of the top concerns for AI practitioners and a crucial factor in determining model performance. A biased model won't perform well for certain user groups and will require retraining on data that's more representative of those groups. To avoid this outcome, be aware of the steps your team can take to mitigate bias and build more responsible AI. First, let's review common bias patterns to watch out for:

Sample bias or selection bias occurs when a data set doesn't reflect the realities of the environment in which a model will run. For example, certain facial recognition systems trained primarily on images of one gender will have lower levels of accuracy for any other gender.

Exclusion bias most commonly occurs at the data preprocessing stage and is often a case of deleting valuable data thought to be unimportant.

Measurement bias occurs when the data collected for training differs from that collected in the real world or when faulty measurements result in data distortion. For example, measurement bias can occur in image recognition data sets, where the training data is collected with one type of camera but the production data is collected with another. This type of bias can also occur due to inconsistent annotation during the labeling stage of a project.

Recall bias is a kind of measurement bias. Recall bias is common at the data labeling stage. It occurs when similar types of data are labeled inconsistently, resulting in lower accuracy. Let's say, for instance, you have a team labeling images of phones as damaged, partially-damaged, or undamaged. If an annotator labels one image as damaged but a similar image as labelled as partially damaged, your data labels will be inconsistent.

Association bias occurs when the data for a machine learning model reinforces or expands a cultural bias. A data set that includes only male doctors and female nurses, for example, doesn't mean that only men can be doctors and only women can be nurses -- but your model will operate under the assumption that women can't be doctors and that men can't be nurses. Association bias often leads to gender bias.

Data drift/model drift, mentioned in the previous section, occurs when your end users or your model's environment changes over time or develops new patterns.

How to Reduce Bias

There are many types of bias to monitor and applying the best practices we described in the previous section of tis article will go a long way toward reducing them. Your team should also consider the following actions to reduce bias:

Understand how your data was generated. Once you have mapped the data generation process, you can anticipate the types of bias that may appear and design interventions to either preprocess data or obtain additional data.

Perform comprehensive exploratory data analysis. This approach involves analyzing data sets to capture their main characteristics (usually in the form of statistical graphs or other data visualization methods). This analysis provides key insight into areas of bias in your data.

Make bias testing a part of your development cycle and a key performance indicator. If you're working with a third-party data provider, ask if they have bias detection tools you can leverage.

Conclusion

Creating a training data set that is representative of your end users and balanced across your use cases is a proactive process. Your team will likely want to incorporate these and other best practices into a data governance framework to ensure consistency across all your projects and alignment among the people involved in building your AI application.

An ideal data governance framework will set expectations about data collection, the labeling process, data monitoring, and bias mitigation. As much as you can, create rules and processes up front that address common data concerns, but always be open to incorporating team feedback along the way.

About the Author

MingKuan Liu is the senior director of data science at Appen. He has been working in automatic speech recognition, natural language processing, and search relevance ranking areas for two decades. MingKuan has led multiple teams of researchers and engineers to bring cutting-edge algorithms into real-world AI and ML solutions running at a large scale in companies including eBay, Microsoft, and Garmin. In the past few years, he has been leading the Appen data science team to develop ML-based automation solutions that combine both human and machine advantages to improve crowd workers' quality and efficiency with reduced bias. You can reach the author here.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

How to Judge a Training Data Set

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

How to Judge a Training Data Set

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career