TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

The Data Lake Manifesto: 10 Best Practices

You need these best practices to define the data lake and its methods.

By Philip Russom
October 16, 2017

The data lake has come on strong in recent years as a modern design pattern that fits today's data and the way many users want to organize and use their data. For example, many users want to ingest data into the lake quickly so it's immediately available for operations and analytics. They want to store data in its original raw state so they can process it many different ways as their requirements for business analytics and operations evolve.

They need to capture -- in a single pool -- big data, unstructured data, and data from new sources such as the Internet of Things (IoT), social media, customer channels, and external sources such as partners and data aggregators. Furthermore, users are under pressure to develop business value and organizational advantage from all these data collections, often via discovery-oriented analytics.

For Further Reading:

Managing the Data Lake Monster

The Data Lake Is a Method that Cures Hadoop Madness

Busting 5 Myths about Data Lakes

A data lake, especially when deployed atop Hadoop, can assist with all of these trends and requirements -- if users can get past the lake's challenges. In particular, the data lake is still very new, so its best practices and design patterns are just now coalescing. Most data lakes are on Hadoop, which itself is immature; a data lake can bring much-needed methodology to Hadoop. To the uninitiated, data lakes appear to have no methods or rules, yet that's not true. In fact, best practices for the data lake exist, and you'll fail without them.

To help data management professionals and their business counterparts get past these challenges and get the most from data lakes, the remainder of this article explains "The Data Lake Manifesto," a list of the top 10 best practices for data lake design and use, each stated as an actionable recommendation.

The Data Lake Manifesto

1. Onboard and ingest data quickly with little or no up-front improvement.

One of the innovations of the data lake is early ingestion and late processing, which is similar to ELT, but the T is far later in time and sometimes defined on the fly as data is read. Adopting the practice of early ingestion and late processing will allow integrated data to be available ASAP for operations, reporting, and analytics. This demands diverse ingestion methods to handle diverse data structures, interfaces, and container types; to scale to large data volumes and real-time latencies; and to simplify the onboarding of new data sources and data sets.

2. Control who loads which data into the lake and when or how it is loaded.

Without this control, a data lake can easily turn into a data swamp, which is a disorganized and undocumented data set that's difficult to navigate, govern, and leverage. Establish control via policy-based data governance. A data steward or curator should enforce a data lake's anti-dumping policies. Even so, the policies should allow exceptions -- as when a data analyst or data scientist dumps data into analytics sandboxes.

Document data as it enters the lake using metadata, an information catalog, business glossary, or other semantics so users can find data, optimize queries, govern data, and reduce data redundancy.

3. Persist data in a raw state to preserve its original details and schema.

Detailed source data is preserved in storage so it can be repurposed repeatedly as new business requirements emerge for the lake's data. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).

As users work with lake data over time, they sometimes break this rule to apply light data standardization when required for reporting, complete customer views, recurring queries, and general data exploration.

4. Improve data at read time as lake data is accessed and processed.

This is common with self-service user practices, namely data exploration and discovery, coupled with data prep and visualization. Data is modeled and standardized as it is queried iteratively, and metadata may also be developed during exploration. Note that these data improvements should be applied to copies of data so that the raw detailed source remains intact. As an alternative, some users improve lake data on the fly with virtualization, metadata management, and other semantics.

5. Capture big data and other new data sources in the data lake.

TDWI survey data shows that over half of data lakes are deployed exclusively on Hadoop, with another quarter deployed partially on Hadoop and partially on traditional systems. Many data lakes are deployed to handle big data (i.e., large volumes of Web data), and so Hadoop is a good fit. Hadoop-based data lakes are increasingly capturing large data collections from new sources, especially the IoT (machines, sensors, devices, vehicles), social media, and marketing channels.

6. Integrate data of diverse sources, structures, and vintages.

Data lakes aren't just for IoT and big data. Many users blend traditional enterprise data and modern big data on a Hadoop-based lake to enable advanced analytics, extend customer views with big data, enlarge data samples of existing fraud and risk analytics, and enrich cross-source correlations for more insightful clusters and segments. In addition, TDWI has seen blended lake data enable logistics optimization, sentiment analysis, near-time business monitoring, patient outcome analytics in healthcare, and predictive maintenance.

7. Extend and improve enterprise data architectures, both old and new.

Data lakes are rarely siloed. Most are integral parts of a larger data architecture or multiplatform data ecosystem -- common examples being the multiplatform data warehouse environment, omnichannel marketing, and the digital supply chain. A lake can also extend traditional applications -- such as those for multimodule ERP, financials, content management, and data or document archiving. Hence, a data lake can be a modernization strategy that extends the useful life and functionality of an existing application or data environment.

8. Make each data lake serve multiple technical and architectural purposes.

A single lake typically fulfills multiple architectural purposes, such as data landing and staging, archiving for detailed source data, sandboxing for analytics data sets, and managing operational data sets (especially complete views and data masters). Even so, when a single data lake plays this many architectural roles, it may need to be distributed over multiple data platforms, each with unique storage or processing characteristics. For example, TDWI surveys show that a quarter of data lakes are on both Hadoop and multiple instances of relational databases.

9. Enable new self-service data-driven business best practices.

These include data exploration, prep, visualization, and some kinds of analytics. Nowadays, savvy users (both business and technical) expect self-service access to lake data, and they will consider the lake a failure without it. Note that self-service functionality is enabled by key components, namely tools built for the high ease-of-use that business users need along with business metadata and other specialized semantics.

10. Select data management platforms that satisfy data lake requirements.

Hadoop is the preferred data platform for most lakes due to its low price, linear scalability, and powerful in situ processing for analytics. However, some users implement a massively parallel processing (MPP) relational database when the lake's data is relational and/or requires relational processing (complex SQL, OLAP, materialized views).

Hybrid platforms are on the rise with data lakes; they may combine Hadoop and relational systems or on-premises and on-cloud systems. With many data collections (data lakes, warehouses, big data, analytics, etc.), TDWI sees an increase in cloud storage, whether file/folder, object, or block.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

In-Depth Training on Data Platforms & Architecture

TDWI offers industry-leading education on best practices for Data Platforms & Architecture. Check out upcoming conferences and seminars to find full-day and half-day courses taught by experts. Save 30% on your first event with code 30Upside!

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Data Lake Manifesto: 10 Best Practices

In-Depth Training on Data Platforms & Architecture

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Data Lake Manifesto: 10 Best Practices

In-Depth Training on Data Platforms & Architecture

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career