TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

4 Reasons to Use Graphs to Optimize Machine Learning Data Engineering

Semantic knowledge graphs accelerate data engineering for machine learning, helping you maximize results.

By Sean Martin
November 9, 2018

Throughout the data ecosystem, organizations are beginning to realize the worth of an enterprise information fabric that uses semantic technology for business understanding to provide uniform access to all data assets -- regardless of how dispersed they are. Forrester Research's recent "Big Data Fabric Wave" report identifies data fabrics as a viable means of dealing with the distributed complexities of big data.

For Further Reading:

Data Requirements for Machine Learning

Machine Learning that Automates Data Management Tasks and Processes

Data Fabrics for Big Data

One of the innate benefits of using a semantic approach to construct a data fabric is the ability to harmonize all data into an enterprise knowledge graph (predicated on business meaning) that's perfect for machine learning data engineering. Traditionally, data preparation was a data science and machine learning bottleneck; it was so time-consuming it limited the impact of this valuable technology.

However, knowledge graphs accelerate this process in four key ways to maximize machine learning data engineering results:

Training data. Graph algorithms provide a richer source of training data than other analytics approaches

Unstructured data. Graphs are difficult to surpass for harmonizing unstructured and structured data

Feature engineering. Semantic graphs drastically decrease the time and effort of feature engineering, partly due to automated query generation

Traceability. When operationalizing machine learning models, graphs provide an immutable provenance chain to retrace data's journey from its initial model testing phase, making it easier to recreate that journey in production

These four characteristics effectively unclog the data science bottleneck hampering machine learning throughout the enterprise, enabling organizations to focus on benefiting from this technology instead of preparing for it.

Optimal Model Input Data

Because graph algorithms are so robust at determining the number and nature of relationships between data elements, they deliver new, richer sources of input data for machine learning models than are available using traditional means. Graph analytics (such as clustering) or simply issuing a query asking for the relationship between data objects (such as people, places, or products) exploits this granular understanding of data relationships.

In relational settings, users must determine the relationships between data elements and issue queries for confirmation; with graphs, you simply ask what the relationships are. Graphs provide additional, more comprehensive sources for input data, and this broader data set significantly improves model training. This difference is indispensable for relationship-dependent sources such as a patient's pharmaceuticals, symptoms, related disease research, and charting information from wearable devices. In graph settings, you simply ask for the relationships between these factors and others; the answers themselves could function as predictors for machine learning models.

Harmonizing Unstructured and Structured Data

Graphs are peerless at aligning unstructured, structured, and semistructured data. When you consider the predominance of today's un-mined unstructured data, this advantage is particularly valuable. For example, semantic graphs easily harmonize the unstructured data gleaned from text analytics with that from traditional tabular databases. The linked-data approach of semantic graphs aligns data sets seamlessly, allowing additional data sources to be simultaneously considered when looking for the best variables to help make predictions.

Largely due to the flexible nature of graph technology, it's easy to start with virtually any data set and readily add others when preparing machine learning models.

Best of all, this harmonization is based on the business meaning of data -- another consequence of the linked-data approach. Semantic graphs are predicated on standards-based data models to which all data (structured or otherwise) adheres. Those ontologies provide a common business meaning for data regardless of originating source or format. When analyzing unstructured sources such as text, there's no telling what organizations might uncover. Semantic graphs ensure that whatever the results are, they'll be harmonized with structured data and the business meaning underpinning their value to the enterprise. Existing data sets described using open standard graph descriptions are also much easier to reuse in any combination.

Feature Engineering

Perhaps the biggest differentiator of the proper application of knowledge graphs for data preparation is the acceleration -- and automation -- of feature engineering. Feature engineering is the process whereby data scientists identify the relevant data attributes that predict the desired outcome of machine learning models; it's essential for model accuracy. Oftentimes, there's a direct correlation between time-consuming data preparation and inefficient feature engineering that slows the production of machine learning models. Thus, data prep and feature engineering are viewed by nearly three-fourths of data scientists as the least enjoyable part of their jobs.

Graphs can expedite feature engineering and feature selection partly because of automatic query generation and transformation capabilities. Accelerating this part of engineering machine learning models allows for increased numbers of features, which positively impacts model accuracy. By assisting data scientists and engineers with the transformations necessary for feature engineering, graphs shorten the process from days and weeks to hours.

Traceability

Traceability, also known as data lineage or data provenance, is pivotal for ensuring production-level accuracy and consistency commensurate with that of the training period for machine learning models. Models are trained with specific input data that delivers equally specific outputs. As such, most initial models are brittle and require data as similar as possible to that used during their training. The provenance of graph databases illustrates the flow of data used to train models. This lineage provides a road map for recreating data's journey once models are put into production. Traceability shows how to reconstruct the data flow to leverage models without having to rebuild or substantially tweak them.

When building a machine learning model to predict patient outcomes for a specific medication or prescription, for example, a host of information about that specific patient -- potentially contained in scores of tables and documents -- must be encapsulated within that model. Provenance demonstrates just how it was captured and what processes took place, which is invaluable when operationalizing models.

A Final Word

The graph approach expedites machine learning data engineering for more effective models than are otherwise possible. It accelerates this process by rapidly harmonizing unstructured data alongside semistructured and structured data; automated query generation considerably reduces the time required for feature engineering and feature selection.

Moreover, graphs make the preparation process more effective by offering a new, relationship-savvy source of training data and issuing a provenance chain redeemable for ongoing value when operationalizing models. This combination enables organizations to optimize data engineering so they can concentrate on machine learning's value.

About the Author

Sean's experience covers multiple aspects of starting and growing a software company, including holding various titles from president through co-lead dishwasher. He continues in a leadership role as CTO and serves on the board.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

4 Reasons to Use Graphs to Optimize Machine Learning Data Engineering

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

4 Reasons to Use Graphs to Optimize Machine Learning Data Engineering

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career