TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Great Machine Learning Needs Careful Data Engineering

A new TDWI Checklist Report examines best practices for data engineering and management to support machine learning with a focus on collecting, cleansing, transforming, and governing new and big data for analysis.

By James E. Powell
December 5, 2018

In a new TDWI Checklist Report, "Five Data Engineering Requirements for Enabling Machine Learning," Fern Halper, vice president and senior director of TDWI Research for advanced analytics, notes how a new generation of data is reinvigorating interest in AI and machine learning -- and providing new challenges to enterprises of all sizes.

For Further Reading:

Machine Learning that Automates Data Management Tasks and Processes

Data Requirements for Machine Learning

Minimizing the Complexities of Machine Learning with Data Virtualization

Machine learning does what its name implies -- it is a system that learns to identify patterns by examining data. There are two approaches: supervised (where the system is given the desired target and learns to predict the same outcome based on attributes) and unsupervised (where there are no predefined outcomes, and once trained, the model is tested against additional data to make sure the model is valid).

Although still in the early mainstream phase of adoption, machine learning is being deployed in a wide range of use cases, including recommendation engines, fraud detection, churn analysis, and cybersecurity. The technology isn't new -- it's been around since the 1990s. As Halper points out, "the advent of big data has, in several important ways, both revitalized machine learning and increased the complexity of using these models to drive insight and action."

The challenge is moving from this model-building "training" phase to full production. "Data engineers must create robust production data pipelines to feed machine learning models the increasing amounts of disparate data they require," Halper explains.

The report discusses best practices for data engineering and management to support machine learning; she focuses on collecting, cleansing, transforming, and governing "new" and big data for analysis. Although organizations may have used rules-based AI systems based on heuristics in the past, they are now moving to automated discovery against vast volumes of disparate data.

Best Practices Lead to Better Results

For machine learning, more is better -- having more data brings more accurate results, and having widely diverse data is better still. Whether rich, new data sources are internal or external to the organization, two popular platforms are proving their worth when it comes to managing data for model building: data lakes and the cloud. Data management platforms also need to handle a new set of sourcing strategies to deal with different ingestion patterns (such as streaming data) and enable data enrichment (such as including metadata or geocoding).

Of course, low-quality data leads to low-quality machine learning results. To that end, Halper suggests seeking out tools that can ensure standardization and accuracy. "The good news is that more vendor solutions are now using advanced technologies such as artificial intelligence to identify (and often correct) data problems."

Data for model building must also be up-to-date, Halper warns. Currency is important when building the initial model and to ensure that the model doesn't become stale -- like automobiles, models occasionally need to be tuned up.

Data engineers and data scientists must be able to engineer the right features for the model, which often requires access to disparate data sources. Halper says that newly derived features need to be stored and persisted to whatever data store the organization is using for analysis, and the calculations necessary to re-create the features must be tracked.

Finally, great machine learning models also require data governance. For governance to work, your enterprise will need to invest in processes as well as tools. Two key tooling areas Halper recommends are management of metadata (data descriptions including data types and structures) and attention to data lineage (which describes where data originated and how it has been changed and transformed).

Halper's report includes dozens of concrete recommendations that will help any enterprise, large or small, start off on the right foot with machine learning. You can read the full report here. Visitors new to TDWI must complete a short, one-time registration for access.

About the Author

James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him via email here.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Great Machine Learning Needs Careful Data Engineering

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Great Machine Learning Needs Careful Data Engineering

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career