Laying the Foundation to Unlock AI Value at Scale
To provide robust data logistics, your data fabric will need these four traits.
- By Jack Norris
- April 26, 2019
McKinsey predicts that artificial intelligence has the potential to begin creating between $3.5 trillion and $5.8 trillion in value annually across nine business functions in 19 industries. Forrester Research predicts that by 2020, those organizations leading the adoption of machine learning, deep learning, and AI will be poised to take away more than $1.2 trillion in value from their wait-and-see peers.
Although the term AI is overly used and often abused, the ability for software to learn from data and gain insights from that knowledge holds enormous potential. Indeed, AI is an inevitable step in the evolution of big data analytics, which began with descriptive results, progressed to offering predictive results in real time, and will ultimately be able to deliver prescriptive results.
The evolution of big data has always been -- and will continue to be -- fraught with pitfalls. This article examines the potential impediments to achieving value via AI technologies over the long term and explains why data logistics will be essential to doing so at scale.
Potential Roadblocks on the Journey to Long-Term Value
The operative word here is journey, and the data analytics journey has a reputation for covering a lot of ground quickly based on relentless advances on multiple fronts. As workloads evolved from data warehouses to data lakes to machine learning and AI, deployment models have evolved from being on premises to the cloud to multicloud and, most recently, to containerization. Constant change also characterizes the myriad algorithms and tools used.
Underlying the journey is the constant tension created by conflicting objectives within an organization. Users (business units with their managers, analysts, and data scientists) want all data to be freely available to foster flexibility and facilitate innovation. The IT department needs to protect the data and enforce applicable security and personal privacy regulations.
A common denominator in this tension is containers. Containers have many desirable traits in data analytics applications -- including transience, minimal overhead, and rapid spin-up -- that combine to simplify developing and testing new applications. The problem is that containers do not hold their own data, making it challenging to implement stateful applications.
Another challenge involves the tools and algorithms, which are certain to evolve. In the Google white paper "Machine Learning: The High-Interest Credit Card of Technical Debt," the authors note that the entanglements created by ML models create complexity that makes it difficult or impossible to implement isolated improvements. This is caused by what the authors dubbed the CACE principle: changing anything changes everything.
With the only constant being change, it is important to find anything that might endure well into the foreseeable future of the data analytics journey. It is becoming increasingly clear that the best hope for long-term agility with stability involves data logistics. The e-book "Machine Learning Logistics" quantifies the importance of logistics with its claim that "90% of the effort in successful machine learning is not about the algorithm or the model or the learning. It's about logistics." Indeed, being a card-carrying member of the Tool-of-the-Month Club while ignoring data logistics virtually ensures that a data lake will turn into a swamp over time.
Data logistics puts the focus on the data. This may seem obvious and inconsequential, but the change represents a paradigm shift from previous application-centric architectures. Such data-centricity creates an enduring foundation that will facilitate making incremental and isolated improvements regardless of how the data, tools, or algorithms change.
Data Fabric as the Foundation
Fabric means different things to different people, but at a minimum a data fabric must provide robust data logistics across three dimensions: (1) different data types stored at and streaming from (2) different sources/locations for (3) authorized access by different groups of users.
Accommodating multiple data types is now mostly straightforward, but AI and ML will need even more (and potentially more types of) data to produce the best results. The locations where data is stored and generated are increasingly distributed at the ever-expanding edge in multiple clouds and the Internet of Things. The different groups that need to be able to collaborate must include the IT staff (responsible for data logistics in the data fabric), the data scientists (creating the algorithms), and the business analysts and decision makers (charged with creating value for the organization).
Providing robust data logistics across all three dimensions requires the data fabric to have four basic traits. First and foremost is the fabric's architecture must be extensible and scalable, enabling it to evolve and grow over time. Although it is difficult to determine with any certainty if an architecture has (and will continue to have) such enduring versatility, it is possible to find both good and bad characteristics in the existing design. Being able to distribute all functions, for example, affords linear scalability. There should not be a single metadata repository or any dependency on governance that resides in its own database because these could limit scalability and add to the administrative burden as the fabric grows.
The second trait is open access with support for a wide range of open standards and open source software. The fabric will need to serve a broad spectrum of applications, and that will require it to have a comprehensive set of APIs. For Hadoop and Spark, for example, an API for the Hadoop Distributed File System (HDFS) is needed. A POSIX API gives Spark and many other applications read/write access to a distributed data store that makes it appear as if the data were mounted locally, which greatly expands the possibilities for future applications. Other popular APIs include HBase, JSON, ODBC, and OJAI (Open JSON Application Interface).
One of the most important APIs for AI is the Container Storage Interface. CSI defines the interface between the container orchestration layer and the data fabric, thereby providing the data persistence needed to effectively make containerized applications stateful. Support for these and other APIs simplifies achieving a publish/subscribe architecture, potentially in containerized microservices, and this alone removes a potential roadblock on the data analytics journey.
The third trait is the fabric must support the distribution of data across multiple locations, including on premises, at the edge, and in the cloud. This requires the fabric to have a global namespace for viewing and managing the data across all physical locations. The fabric should also permit metadata to be distributed across all data stores, which can help eliminate application-specific dependencies. Finally, it should allow you to control the location of all data and processing to help enforce policies for optimizing performance, containing costs, and complying with applicable regulations.
The fourth and final trait is that the fabric's security must ensure the data is accessed only by authorized users. This is especially important for the personally identifiable information that is often subject to strict government regulations with heavy fines for violations. Data security also requires the fabric to be compatible with existing security provisions, especially at the perimeter, that protect against hacking and malware. Finally, the fabric must provide the means for the data itself to be protected from loss or corruption via existing backup/restore and disaster recovery provisions.
Getting Started
Data fabrics are easy to create in a development environment, so start there with an existing use case and pay close attention to the data logistics. Assume that the tools and techniques being used will be replaced or improved over time, and experiment with how the data fabric can facilitate those changes. Continuous learning is the key to the journey, so expand with additional use cases and data sets.
Rolling out the data fabric in a production environment is also relatively easy because it can be performed one application at a time. Existing applications can also be modified by integrating their data sets into the growing, distributed data fabric, and that will likely create new opportunities for gaining additional insights.
A journey of a thousand miles begins with a single step. Continuing the data analytics journey will involve taking many steps, and one of the best ones organizations can take now is to invest in the data logistics needed to establish a solid foundation for unlocking AI value at scale well into the future.