TDWI Blog

Philip RussomPhilip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email (prussom@tdwi.org), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).


Top Twelve Priorities for Data Warehouse Modernization

Top Twelve Priorities for Data Warehouse Modernization

By Philip Russom, Senior Research Director for Data Management, TDWI

No matter the vintage or sophistication of your organization’s data warehouse (DW) and the environment around it, it probably needs to be modernized in one or more ways. That’s because DWs and requirements for them continue to evolve. Many users need to get caught up by realigning the DW environment with new business requirements and technology challenges. Once caught up, they need a strategy for continuous modernization.

To help you organize your modernization efforts, here’s a list of the top twelve priorities for data warehouse modernization, including a few comments about why these are important. Think of the priorities as recommendations, requirements, or rules that can guide user organizations into successful strategies for implementing a modernization project.

1. Embrace change. Data warehouse modernization is real; a recent TDWI survey says that 76% of DWs are evolving moderately or dramatically. Given the rampant amount of change in markets and individual businesses, it’s unlikely the status quo will serve you and your organization for much longer. Besides, change is an opportunity for improvement, as long as you manage it with specific directions in mind.

2. Make realignment with business goals your top priority. This is the leading driver according to a recent TDWI survey. Learn the goals of the business and collaborate with business and technical people to determine how business goals map to technology and data. Then base your modernizations on the requirements thus defined. If alignment is achieved, the whole business will modernize, not just the warehouse. And that’s the real point.

3. Make DW capacity a high priority on the technology side. The second most pressing driver is greater capacity for growing data, users, reports. This is no surprise given the explosive growth of traditional enterprise data and new big data. 3-10TB is today’s norm for DW data volume in the average-size organization; however, the norm will soon become 10-100TB, as DW programs graduate from lesser data volumes to greater ones. These are known capacity goals for successful DWs, so keep them in mind when planning capacity modernization.

4. Make analytics a priority, too. One third of DW professionals modernize for better and newer analytics. That’s a technology challenge for the warehouse, since diverse analytic techniques have diverse data preparation requirements, and they don’t all fit the traditional warehouse. Therefore, additional data platforms and tools that complement older ones may be in order. Keep in mind that analytics is what business users want; your pristine data and elegant architecture won’t mean much, if modernization fails to deliver relevant analytics.

5. Don’t forget the related systems and disciplines that also need modernization. Top priorities are analytics, reporting, and data integration, followed by development methods and team characteristics. Align the modernization of the DW, so it can ably provision the data in a manner that these other disciplines require for their success.

6. Don’t be seduced by new, shiny objects. There are lots of new and cool technologies and tools available today, and many get evaluated for DW modernization. Before adopting one, be sure it goes beyond the bling to satisfy real-world requirements in a performant and cost-effective manner.

7. Assume that you’ll need multiple manifestations of modernization. To get the desired results, you should consider multiple modernization strategies, but try not to execute them all at once, in a big bang.

8. Be familiar with today’s tools and techniques for the modern data warehouse environment (DWE). Extending the number and type of standalone platforms within a DWE is one of the strongest trends in data warehouse modernization, because it adds value in the form of additional platforms, without ripping out or replacing established platforms.

9. Adjust the large-scale architecture of your DWE. The rise of the multi-platform DWE is forcing the modernization of system architectures. For most situations, you will keep and improve your centralized, relational DW. But you should expect to complement it with other platforms, then migrate data and balance workloads among platforms. This requires you to rework the large-scale architecture, which determines how diverse platforms integrate and interoperate, plus which data goes where and how data show flow among platforms.

10. Reevaluate your DW platform. The condition of your data is important, but it’s all for naught if the platform can’t capture, manage, and deliver data with speed, scale, and broad functionality at a reasonable cost. Replacing a DW platform is disruptive and expensive for a business. Therefore, consider leaving your existing DW platform in place, but update it and complement it with other systems. Even so, grossly deficient or outmoded platforms should be replaced.

11. Consider Hadoop for various roles in the DWE. Hadoop’s massive and cheap storage offloads older systems by taking responsibility for data staging, ELT push down, and the archiving of detailed source data (retained for advanced analytics). Hadoop also serves as a massively parallel execution engine for a wide variety of set-based and algorithmic analytic methods. Conventional wisdom says Hadoop usually complements a DW without replacing it. That’s what early adaptors do with Hadoop in DWEs today. And the number of organizations integrating Hadoop with a DW continues to increase.

12. Develop plans and recurring cycles for DW modernization. Most DW teams have settled on a quarterly schedule for updating DWs. This applies to tasks of many sizes; well-contained phases of some modernization projects may fit this scheme, as well. However, large-scale modernizations typically need their own plan. The more disruptive a modernization (such as rip-and-replace), the more critical to success is the multi-phase plan (sometimes the multi-year plan). Modernization affects business users and their processes; for minimal disruption, business managers should be involved in developing and executing modernization plans.

ANNOUNCEMENT

To learn more about modernizing data warehouses and related IT systems, attend my TDWI webinar Data Warehouse Modernization in the age of Big Data Analytics, coming up on April 14, 2016. Register online for the webinar: http://bit.ly/DWMod16

This webinar will quantify trends in data warehouse modernization and catalog technologies that are relevant. It will also document strategies and user best practices for organizing modernization projects. The goal is to help DW professionals and their business counterparts plan the next generation of their data warehouse, in alignment with business goals.

Posted on March 24, 20160 comments


An Introduction to Data Warehouse Modernization

By Philip Russom, Senior Research Director for Data Management, TDWI

As any data warehouse professional can tell you, the average data warehouse (DW) is today evolving, extending, and modernizing, to support new technology and business requirements, as well as to prove its continued relevance in the age of big data and analytics. This process has become known as data warehouse modernization; synonyms include DW augmentation, automation, and optimization. Every user organization and its DW is a unique scenario, so every modernization program is, too. Even so, a few common situations, drivers, and outcomes have arisen.

DW modernization takes many forms.

For example, common scenarios range from software and hardware server upgrades to the periodic addition of new data subjects, sources, tables, and dimensions. However, data types and data velocities are diversifying aggressively, so data modernization progressively involves users’ diversifying their software portfolios to include tools and data platforms built for big data from new sources. As portfolios swell, most data warehouses (DWs) are evolving – or modernizing – into complex and hybrid multi-platform data warehouse environments (DWEs). Though surrounded by complementary systems and tools, the traditional data warehouse is still the primary core of the modern DWE. Even so, a few organizations are decommissioning current data warehouse platforms to replace them with modern ones optimized for today’s requirements in big data, analytics, real-time operation, high-performance, and cost control. No matter what modernization strategy is in play, all require significant adjustments to the logical layers and systems architectures of the extended DWE.

Looking inside the average data warehouse, we see many opportunities for DW professionals to initiate or expand the use of recent technology advancements, such as in-memory processing, in-database analytics, massively parallel processing (MPP), multi-platform federated queries, and Hadoop. Furthermore, there are many new database management systems purpose-built for analytics, based on columns, appliances, graph, MapReduce, NoSQL, and other innovations. Best practices can likewise be modernized by adapting agile, lean, logical, and virtual methods, or by moving to modern team structures, such as the competency center or center of excellence.

Systems outside the DW need modernization, too.

Looking outside the warehouse, multiple disciplines have their own modern innovations that need support from a more modern DW. For example, new business practices need bigger, newer, and fresher data, so the business can compete on analytics, get actionable business value from new big data, and monitor the business in real time. As another example, business intelligence (BI) is experiencing its own modernization right now, and BI needs the DW to provision data for modern BI practices, such as visualization, data exploration, and self service. Likewise, many organizations are complementing their mature investments in online analytic processing (OLAP) with an exploding array of techniques for advanced analytics.

ANNOUNCEMENT

To learn more about modernizing data warehouses and related IT systems, attend my TDWI webinar Data Warehouse Modernization in the age of Big Data Analytics, coming up on April 14, 2016. Register online for the webinar: http://bit.ly/DWMod16

This webinar will quantify trends in data warehouse modernization and catalog technologies that are relevant. It will also document strategies and user best practices for organizing modernization projects. The goal is to help DW professionals and their business counterparts plan the next generation of their data warehouse, in alignment with business goals.

Posted on March 15, 20160 comments


Seven Recommendations for Becoming Big Data Ready

New big data sources and data types – and the need to get business value from new data – are forcing organizations to evolve their data management practices.

By Philip Russom, TDWI Research Director for Data Management

I recently participated as a core speaker in the Informatica Big Data Ready Virtual Summit, sharing a session with Amit Walia, the Chief Product Officer at Informatica Corporation. Amit and I had an interactive conversation where we discussed one of the most pressing questions in data management today, namely: How should an organization get ready to capture and leverage big data? This is an important question, because many organizations in many industries are facing big data, with its new data sources, data types, large volumes, and fast generation rates. Organizations need to modernize their data integration (DI) infrastructure, so they can capture and leverage the new data for new business insights and analytics.

Amit Walia and I boiled down this complex issue to seven recommendations, which I will now summarize:

Achieve agility and autonomy, as required of big data and analytics. The creation of data management solutions must keep up with the pace of business by adopting agile and lean development methods. New tool functions that assist with agility and autonomy include those for data exploration and profiling, self-service data access, and rapid dataset prototyping (or “data prep”).

Govern big data, as you would any enterprise data asset. Big data has a bit of a “hall pass” today, because it’s new and exotic. But eventually, it will be assimilated as yet another category of enterprise data. Prepare for that day, by assuming that new data demands governance, stewardship, privacy, security, quality, and standards.

Include Hadoop in your data integration infrastructure. Hadoop can replace some of the database management systems and file systems you’re using today, while scaling at a reasonable cost and handling new data types. Modern users’ DI architectures already include Hadoop for landing, staging, push-down processing, archiving, hubs, and lakes.

Integrate fit-for-purpose data to enable data exploration and profiling. The trend is to integrate big data in its raw, original state, into a big data platform, such as Hadoop or a large relational MPP implementation. That way, users can explore and profile new big data to determine its business value. Later, users can repurpose discovered data many ways, sometimes at runtime, as new requirements arise for analytics or operations.

Embrace real-time data ingestion, as required by some forms of big data and analytics. A modern DI infrastructure supports many speeds and frequencies of data ingestion, because diverse data sources and business processes have diverse requirements relative to time. A new challenge for DI is to capture and process, streaming data in real time, to enable near time analytics and business operations.

Prepare to integrate big data by upgrading skills and team structures. TDWI surveys say that a lack of skill is the biggest barrier to success with new big data. Data management professionals need training for Hadoop, NoSQL, natural language processing, and new data types (e.g., JSON, social media, streams). These competencies should be added to those of existing DI competency centers.

Modernize data management solution development by combining agile, stewardship, and collaborative methods. Both agile and stewardship methods recommend the use of a pair of specialists, working together closely: a data specialist and a business representative (or steward). This “dynamic duo” accelerates requirements gathering, ensures data-to-business alignment, and delivers solutions faster than ever.

If you’d like to hear more of my discussion with Informatica’s Amit Walia (and hear other expert speakers in the Informatica Big Data Ready Virtual Summit, too), please replay the Informatica Webinar by clicking here.

Posted on January 6, 20160 comments


Igniting the Analytic Spark

An Introduction to Apache Spark and its uses in Business Intelligence (BI), Data Warehousing (DW), and Advanced Analytics

Blog by Philip Russom
Research Director for Data Management, TDWI

At TDWI, we’re hearing a lot of interest in Apache Spark, although it’s still new and most users are unfamiliar with it. So, please allow me to define Spark for you, explain its potential benefits, and describe actual use cases.

Apache Spark is a parallel processing engine. It specializes in big data, and works well with Hadoop environments. However, Apache is not just for Hadoop; it provides parallel processing for other environments, too. Spark is known for high speed and low latency, which it achieves by leveraging in-memory computing and cyclic data flows. 

Spark is fast. Very fast. Benchmarks show Spark to be up to one hundred times faster than Hadoop MapReduce with in-memory operations. Spark is ten times faster than MapReduce with disk-bound operations. The point is that Spark has the low latency required of new data-driven practices, like data exploration, discovery, streaming analytics, and SQL-based analytics.

Spark functions apply directly to applications in BI, DW, DI, & analytics. Spark today includes four libraries of functionality, and each is of interest to professionals in BI, DW, and analytics. The libraries support ANSI-standard SQL, streaming data, machine learning, and graph analytics.

A Spark library provides native support for ANSI and ISO standard SQL. In a recent TDWI survey, 69% of users surveyed said that ANSI- and ISO-standard SQL on Hadoop is required for broad enterprise use. That’s because a modern enterprise wants to leverage pre-existing SQL skills and SQL-based tools. Furthermore, users want fast queries on Hadoop, to enable data exploration, analytics, and other interactive, data-driven practices. Spark and its SQL support promise to enable these – in both batch or interactive sessions, for Hadoop and other environments – which in turn will spark big data analytics for users in BI, DW, and analytics.

Spark offers broad compatibility. Spark SQL reuses the Hive front-end and metastore, to provide compatibility with existing Hive data, queries, UDFs. Spark SQL’s server mode extends interoperability via industry-standard ODBC/JDBC. Spark can process data in S3, HDFS, HBase, Hive, Cassandra, and any Hadoop InputFormat.

Spark can be deployed many ways. Spark requires some kind of shared file system (NFS compliant), so its deployment options are diverse. Spark runs on its standalone cluster, Hadoop YARN, Apache Mesos, and Amazon EC2; on premises or cloud. A single job, query, or stream processing can be deployed in either batch or interactive mode via Scala, Python, and R shells.

Spark has one console for the seamless development of diverse functionality. Apache Spark includes libraries for four high-level applications: SQL, streaming data, machine learning, and graph analytics. These are integrated tightly, so users can create applications that mix SQL queries and stream processing alongside complex analytic algorithms.

Spark and its libraries enable several application types for BI, DW, and analytics:
  • SQL analytics and related set-based applications – e.g., data exploration and discovery, customer-base segmentation, financial analyses, dimensional modeling and analysis, reporting, ETL pushdown that requires SQL
  • Stream capture and analysis -- monitoring facilities (utilities, factories), tracking social sentiment, predictive machine maintenance, reroute vehicle traffic, manage mobile assets, any time-sensitive process
  • Graph analytics -- anomaly detection for fraud or risk, behavioral analysis, entity clustering, patient outcome optimization
  • Mixtures of the above – a trend among users is to mix multiple analytic methods in a single application, because each reveals different insights

Want to learn more about Spark? Click here to replay my recent TDWI Webinar, where go into more detail about Spark and its uses in BI, DW, and analytics.

Posted by Philip Russom, Ph.D. on December 7, 20150 comments


Emerging Technologies and Methods: An Overview in 25 Tweets

Blog by Philip Russom
Research Director for Data Management, TDWI

To help you better understand what today’s emerging technologies and methods (ETMs) are – especially those related to business intelligence, analytics, and data warehousing – I’d like to share with you the series of 25 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of ETMs in a form that’s compact, yet amazingly comprehensive.

Each tweet below is a short sound bite or stat bite drawn from the recent TDWI report “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which I researched and wrote with my colleagues David Stodder and Fern Halper. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Examples of Emerging Technologies and Methods (ETMs)

1. Most Emerging Techs & Methods (#ETMs) fall into 3 layers of BI, #analytics & #EDW tech stack.

2. #ETMs for #BI include #DataViz, #DataExploration, #DataPrep, #Dashboards, #MashUps, #MobileBI.

3. #ETMs for #Analytics operate on data from #SocialMedia, #IoT, streams, #MachineData.

4. #ETMs for #DataMgt include #Hadoop, #ApacheSpark, #NoSQL, in-DBMS #analytics, in-mem DBMS, columnar.

Examples of Emerging Methods and Platforms

5. #EmergingMethods include agile & lean dev methods applied to whole BI/DW/#analytics tech stack.

6. Other #EmergingMethods include #CompetencyCenters, #CollaborativeBI, #StoryTelling, #DataGovernance.

7. Emerging platforms include many types of clouds, #SaaS, #OpenSource, appliances...

The Importance of ETMs

8. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are very important (53%) or somewhat (39%).

9. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are opportunity to compete, evolve, perform (79%).

10. #TDWI SURVEY SEZ: Two-thirds of respondents (64%) already have #ETMs in production.
Benefits and Barriers for ETMs

11. #TDWI SURVEY SEZ: Top benefits of #ETMs = competitiveness, decision support, biz performance, innovation.

12. #TDWI SURVEY SEZ: Top barriers to #ETMs = LACK of skills, budgets, biz value, innovation.

13. #TDWI SURVEY SEZ: Other barriers to #ETMs = poor state of IT infrastructure & poor #DataGovernance.

User Satisfaction with Current State of ETMs

14. #TDWI SURVEY SEZ: 41% dissatisfied with their enterprise adoption of Emerging Techs & Methods (#ETMs).

15. Adoption of agile development methods is one of strongest trends in BI, #analytics, #EDW today.

16. #TDWI SURVEY SEZ: 55% dissatisfied with time required of development for BI, #analytics, #DataMgt.

User Success with Current State of ETMs

17. #TDWI SURVEY SEZ: Users successful with #ETMs for #SelfServiceBI (54%) & #DataPrep (50%).

18. #Hadoop & #NoSQL #ETMs are challenging for tools & apps built for relational data.

Emerging Data Types for Analytics

19. #TDWI SURVEY SEZ: 84% analyze structured data today. Suprising that 16% are not; maybe text analytics?

20. #TDWI SURVEY SEZ: #IoT data used by <20% of respondents today, but 40% more will use within 3 years.

21. Other data sources poised for growth = Machine data (sensors, devices) & #RealTime #EventStreaming.

22. #TDWI SURVEY SEZ: In clouds, users already do #EDW (35%), #Analytics (31%), sandbox (29%), DataInt (24%).

23. #TDWI SURVEY SEZ: 49% have production #PredictiveAnalytics today; another 39% will in 3 yrs.

ETMs for Data Warehousing & Data Management

24. #TDWI SURVEY SEZ: 3-yr hi growth in #DataMgt #ETMs = #RealTime, streams, #DataPrep, #Hadoop, #CloudDW.

25. Top security #ETMs for #DataMgt = #DataProtection (encrypt, mask, token), not just user name/pswd.

Want to learn more about Emerging Technologies and Methods (ETMs)?

For a more detailed discussion – in a traditional publication! – get the TDWI Best Practices Report, titled “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which is available in a PDF file via a free download

You can also register for and replay the TDWI Webinar, where David Stodder, Fern Halper, and I discuss the findings of the TDWI report.

Posted by Philip Russom, Ph.D. on November 9, 20150 comments


Trip Report: What I Learned at Informatica World 2015

Inspirational User Case Studies and Educational Product Demonstrations

By Philip Russom, TDWI Research Director for Data Management

When I attend a user group meeting or a vendor’s conference, my top two priorities are (1) to hear case studies from successful users and (2) to see practical demonstrations of the vendor’s products. I got both of those in spades last week, when I spent three days attending Informatica World 2015 in Las Vegas.

It was a huge conference, with about 2,500 people attending and five or more tracks running simultaneously. I couldn’t attend all these sessions, so I decided to focus on the keynotes and the Data Integration Track. To give you a taste of the conference, allow me to share highlights from what I was able to attend, with a stress on case studies and demos.

User Case Studies

An enterprise architect at MasterCard discussed their implementation of an enterprise data hub. The hub gives data analysts the data they need in a timely fashion, provides self-service data access for a variety of users, and serves as a unified platform for both internal and external data exchange.

Tom Tshontikidis explained why and how Kaiser Permanente migrated its large collection of data integration solutions from a legacy product (heavily extended via hand coding) to PowerCenter and other Informatica tools.

Two representatives from Cleveland Clinic spoke of their journey from quantity based metrics for performance management (which mostly laid blame on employees for missed targets) to quality based predictive analytics (which now sets realistic goals for helping their patients).

Dr. John Frenzel is the chief medical information officer at the MD Anderson Cancer Center. At Informatica World, he discussed how big data analytics is accelerating clinical research. Among the many great tips he shared, Frenzel described how data scientists at MD Anderson work like consultants, traveling among multiple teams, to share their expertise.

An IT systems architect at a major telecommunications company told the story about how they needed to simplify operations, so it could transform into better integrated – and hence more nimble – global organization. In support of those business goals, IT replaced hundreds of systems, mostly with six primary ones. This gargantuan consolidation project was mostly powered by Informatica tools.

Tom Kato of Mototak Consulting spoke in a few sessions. In one, he described how to manage data from cradle to grave, using best practices and leading tools for Information Lifecycle Management (ILM). In another, he explained his use of the Informatica Data Validation Option (DVO) in an early phase of the merger between American Airlines and US Airways.

John Racer from Discount Tire explained why validating data is important to assuring that data arrives where it’s supposed to be and in the condition intended. He discussed practical applications in cross-platform data flows, application migrations, and data migrations, involving tools from Informatica and other providers.

Product Demonstrations

Some of the coolest demos were presented by users. For example, I saw a management dashboard built by folks at a major energy company, using a visualization tool and data from PowerCenter. The dashboard enables business users to do pipeline capacity management and related operational tasks, many with near time data.

The Informatica Data Validation Option (DVO) kept coming up in presentations by both Informatica employees and customers. I was glad to see this, because I’ve long felt that data integration users do not validate data as often as they should. For example, validation should be part of most ETL testing and all data migration projects.

For a variety of reasons, I was glad see Secure@Source demo’d. The demo clarified that this is not a security tool, per se, although it can guide your security and other efforts. Instead, Secure@Source provides analytics for assessing data-oriented risks relevant to security, privacy, compliance, governance, and so on. Essentially, you create policies and other business rules (typically inspired by your compliance and governance policies), and Secure@Source helps you identify risks and quantify compliance.

Informatica’s Krupa Natarajan spent most of a session demonstrating Informatica Cloud. This product has been in production since 2006, so there’s a lot of robust functionality to look at. Long story short, Informatica Cloud comes across as a full-featured integration tool, not some after-thought hastily ported to a cloud (as too many cloud-based products are). Although Krupa didn’t say it explicitly, the demo brought home to me the point that data integration with a cloud-based tool is pretty much the same as with traditional tools. That good news should help users get more comfortable with clouds in general, as well as the potential use of cloud-based data management tools.

Further Learning

If you go to www.YouTube.com and search for “Informatica World 2015” you’ll find many useful speeches and sessions that you can replay. Here’s a couple of links to get you started:

Keynote by Informatica’s CEO, Sohaib Abbasi. This is a “must see,” if you care about Informatica’s vision for the future, especially in the context of the proposed acquisition of Informatica.

Interviews filmed on site by theCUBE. All the interviews are good. But I especially like the interviews with my analyst friends: John Myers and Mark Smith.

Posted by Philip Russom, Ph.D. on May 18, 20150 comments