TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Data Requirements for Machine Learning

Machine learning can enable new forms of predictive analytics and embed algorithm-driven intelligence into many software applications. However, none of that is possible without the right data, captured and processed the right way.

By Philip Russom
September 14, 2018

Machine learning algorithms consume and process large volumes of data to learn complex patterns about people, business processes, transactions, events, and so on. This intelligence is then incorporated into a predictive model. Comparisons to the model can reveal whether an entity is operating within acceptable parameters or is exhibiting an anomaly.

For Further Reading:

Data Integration and Machine Learning: A Natural Synergy

4 Proven Ways Newbie Analysts Can Become Machine Learning Pros

Humans in the Loop for Machine Learning

Today, machine learning is used to solve well-bounded tasks such as classification and clustering. Note that a machine learning algorithm learns from so-called training data during development; it also learns continuously from real-world data during deployment so the algorithm can improve its model with experience.

Machine learning has a voracious appetite for data during both development and production, making unique demands of an organization's infrastructure for data management.

Data Requirements for Successful Machine Learning

#1: Large, diverse data sets

The development of a machine learning algorithm depends on large volumes of data, from which the learning process draws many entities, relationships, and clusters. To broaden and enrich the correlations made by the algorithm, machine learning needs data from diverse sources, in diverse formats, about diverse business processes.

For the most comprehensive learning experience, you should provide diverse training data -- integrated from multiple sources and concerning various business entities, collected across multiple time frames -- to make algorithmic assessments more real-world, accurate, and successful in production. Once in production, a machine learning algorithm continues to read large, diverse data sets to keep its model up-to-date and growing.

Savvy organizations are deploying tools for multiple types of analytics (not just machine learning), because each type tells them something unique and valuable. Each of these analytics approaches needs data that is prepared and presented in a certain way that is optimal for the analytics tool or the user practice involved. Machine learning algorithms are almost always optimized for raw, detailed source data. Thus, the data environment must provision large quantities of raw data for discovery-oriented analytics practices such as data exploration, data mining, statistics, and machine learning.

#2: Large, diverse infrastructure for data management

Infrastructure for training data for machine learning typically involves multiple data platforms, tools, and processing engines, ranging from traditional (relational and columnar databases) to modern (Hadoop, Spark, and cloud storage). Multiple technologies are required to cope with training data's extreme size, multiple data structures, and (in some cases) multiple latencies. Tools for machine learning are obviously important, but data management infrastructure is just as important.

There are many ways to provision training and production data for machine learning. This data can come from multiple platforms in the extended data infrastructure, but the trend is toward consolidating as much data as possible into a data lake designed for machine learning and other forms of advanced analytics. In a related trend, data lakes are moving toward elastic clouds for reasons of automation, optimization, and economics.

Data management infrastructure can be vast. It can include platforms and tools for data warehousing, data lakes, data integration, data preparation, multiple forms of analytics, and big data. New data platforms are emerging as well, dominated by clouds, open source engines, open source libraries and languages, and self-service tools. That is a long list of platforms, technologies, and processing engines. However, it is all required for modern organizations that want to operate and compete on analytics and intelligence.

Finally, when organizations already have big data infrastructure in place, adding machine learning extends the life cycle and business value of the infrastructure.

To Go In-Depth

Portions of this article were adapted from the 2018 TDWI Checklist Report "The Automation and Optimization of Advanced Analytics Based on Machine Learning." Read the complete report for more information about machine learning and its data requirements.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Requirements for Machine Learning

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Requirements for Machine Learning

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career