Reshaping Data Management for a Hybrid, Multicloud World
Three trends that were at the forefront of both the Cloudera Analysts conference and Strata Data.
- By David Stodder
- November 25, 2019
As a longtime participant in tech conferences at the Jacob K. Javits Convention Center on the west side of Manhattan in New York City, this September's walk to attend the O'Reilly Strata Data Conference was startling. In the old days, my walk took me through a ramshackle district of taxi cab garages and repair shops, vacant lots, hole-in-the-wall bars, and an assortment of humble workshops and warehouses. Now, the walk takes you through the brand-new, not-quite-finished Hudson Yards, a land of gleaming, towering, mirrored skyscrapers and ultra-modern shops and restaurants.
This grand cityscape outside the Javits Center fit with the changes going on inside. Strata used to be focused on Apache Hadoop, its ecosystem of technologies, and the data science that drove leading-edge big data use cases. You would leave with a headful of code and a pocketful of stuffed yellow elephants -- the mascot of Hadoop. Now, Strata Data's scope encompasses cloud computing, artificial intelligence, and diverse data architectures, with Hadoop among them but hardly mentioned.
Yet, Hadoop is not forgotten. At the Cloudera Analysts conference that preceded Strata Data, Cloudera chief product officer (and former Hortonworks co-founder) Arun C Murthy described four areas where "the Hadoop philosophy" still guides the development of data platforms:
- Disaggregated software stack (of storage, compute, security, governance, and more)
- Extremely large scale, using distributed systems and commodity infrastructure, hardware, and the cloud
- Open source and open data standards
- An evolving ecosystem that can include diverse technologies and enable independent innovation at every layer
The Hadoop philosophy (not to mention Cloudera's and Hortonwork's Hadoop runtime distributions) still lives inside the Cloudera Data Platform (CDP), introduced by Cloudera in June but more fully described and fleshed out with services at Strata. Since its merger with Hortonworks in October 2018, the two companies' customers, the technology industry, and concerned financial investors have all been watching to see how well the combined entity would articulate and execute on its strategy. Cloudera has had to adjust as the industry landscape has shifted rapidly to the cloud, where dominant platforms such as Amazon, Google, and Microsoft offer their own Hadoop and Apache Spark services.
The comprehensive CDP and its services, including the newly announced cloud-native Cloudera Data Warehouse, have repositioned Cloudera as "the enterprise data cloud company," to use its own description. Cloudera has shifted its center of gravity to the cloud, but with customers still invested in on-premises systems, it is taking a hybrid, multicloud approach that offers unified management across on-premises and cloud-based systems. CDP works with open-source Kubernetes container management and orchestration to enable easier integration and portability. CDP aims at supporting five secure and governed self-service experiences: flow and streaming, data engineering, operational database, machine learning, and data warehouse.
Cloudera Data Warehouse (CDH) applies containers to enable faster and easier creation of virtual data warehouses. Instead of requiring data engineers to work at a lower level to set up Impala clusters, Cloudera aims to elevate the user experience toward simply declaring a "T-shirt size": that is, set requests that the system can interpret to adaptively scale, auto-provision, and optimize resources for the workloads. At the Analysts conference, Cloudera described optimization capabilities for "bursting" on-premises workloads, data, metadata, and more to the cloud to make the transition faster and more in tune with the elasticity organizations are seeking from cloud deployments.
Three Important Trends
There's more to talk about regarding Cloudera's announcements, but I would like to focus the rest of this article on three trends that were top of mind at both the Cloudera Analysts conference and Strata Data.
Separating computation from data storage. The growing use of containers is important to the trend toward flexible systems that use layered architectures and allow independent selection of data storage services and computation resources, such as the number and type of processing units working in clusters. Over the history of database systems, the pendulum has swung back and forth between tightly coupled computation and storage systems and more loosely coupled systems. Hadoop Distributed File System (HDFS), which largely uses directly attached storage (DAS), has been tightly coupled to avoid latency that can grow when you have to move big data from storage to computation resources. However, as they shift to cloud data architectures, organizations need greater flexibility to choose which type and how much computation they need to handle a particular workload. They also need flexibility on the data storage side so they can switch to newer technologies and also position their data to meet hot, medium, and cold levels of access demands.
Today, faster networks are facilitating looser coupling, where it matters less where the data is stored if it can be moved, replicated, or accessed quickly. Along with faster networks, the other factor driving separation is the use of scalable object storage such as Amazon S3, Google Cloud Storage, and Microsoft Azure Storage, which enables organizations to replicate data across locations more easily. Many organizations are moving their data lakes, for example, to object storage in the cloud.
However, looser coupling across hybrid, multicloud environments can open data systems up to performance, accessibility, and other issues such as too much redundancy. Data must be managed differently to ensure performance efficiency and quality.
This is likely to drive demand for data orchestration, which "brings speed and agility to big data and AI workloads and reduces costs by eliminating data duplication and enables users to move to newer storage solutions such as object stores," according to Alluxio, a solution provider I met with at Strata Data. "Data orchestration is to data what container orchestration is to containers." Alluxio, based on the research project "Tachyon," is an open-source virtual distributed file system that offers a data abstraction layer and interface between computation frameworks and storage.
As hybrid, multicloud environments grow, we will see other data management and integration solutions introduced to address how organizations can avoid swamping networks with massive lift-and-shift data migration to the cloud. Orchestration can also help organizations adhere to regulations that require them to be highly selective about what data gets migrated to the cloud and for which workloads.
Metadata and data catalogs. Knowledge about how data is defined, its lineage, and how it is related to other data is crucial to getting value from data, whether through visual data exploration, advanced analytics, or AI and machine learning. Because they are not tightly integrated, virtualized, loosely coupled, and distributed systems especially need access to good, centralized metadata to coordinate data meaning and collaborative understanding across users and applications. Although not all the same, solution providers such as Alation, Cambridge Semantics, Collibra, and Waterline Data are gaining prominence by variously providing smart, AI-augmented data catalog development and management, faster data discovery, and more self-service, business-driven examination of data relationships. This is a key area for modern, diverse data architectures.
Data quality and governance. These two areas are "mom and apple pie" in that they are important to the success of every kind of data management system and BI or analytics application, whether strictly on premises or in a hybrid multicloud environment. Yet, obviously, in the latter scenario there are new potential exposures to data inconsistency, redundancy, and poor governance. Organizations need data preparation, governance, and integration solutions that enable them to control what data is migrated where. I met with Trifacta at Strata Data, where the company announced new data quality assessment, remediation, and monitoring solutions. Trifacta and other vendors are using AI techniques such as machine learning to enable data profiling and cleansing for higher volume, speed, and variety of data.
The Cloud: Reshaping Views of Data
Just as Hudson Yard's mirrored skyscrapers are changing the look and feel of New York City's West Side, cloud platforms are reshaping how organizations need to view data management and architecture. Older ways of preparing, integrating, and governing data are proving inadequate in hybrid, multicloud environments that are expanding in data volume, speed, and variety. Before leaping into the cloud with both feet, organizations need to assess their readiness and evaluate solutions that might better fit the new data landscape.