November 2, 2015
Think of a data lake as a new-age data repository of “raw” data that offers developers and business users an extended set of features from basic reporting to machine learning to non-relational graph analytics. It’s a data management innovation that promises to reduce data processing and storage costs and enable new forms of business-analytic agility. The value proposition of the data lake is that it provides a cost-effective context in which to support data exploration use cases as well as to host and process the new kinds of analytics many organizations need to do with big data today.
Data lakes are often (but not always) implemented in Hadoop because this file-based platform can accommodate data of different shapes and sizes. This includes everything from event messages or logs generated by applications, devices, or sensors to semi-structured data (such as text) to multi-structured data—a category that includes objects of any conceivable kind, such as JSON and XML files or voice, video, and image files.
This Checklist Report discusses what your enterprise should consider before diving into a data lake project, no matter if it’s your first or second or even third major data lake project. Presumably, adherence to these principles will become second nature to the data lake team and they will even improve upon them at some point.