Your guide to getting started with a data catalog
There’s a lot of buzz around data catalogs right now— and a growing number of solutions from more and more vendors. What exactly is a data catalog? And how do you make sure you are not getting lost in the process of selecting the right catalog to meet your needs? This guide walks through the basics of what a catalog is and how it works, what business challenges it can help solve, and how to make sure you are avoiding common pitfalls and choosing the right one for your needs.
There’s actually a lot of misconceptions on what a data catalog is and what it can do. So, what is it? In a nutshell, a data catalog is a place that shows what data assets you have and where they are located. You might be asking, what is a data asset? That is any entity (i.e. reports, databases, websites) that contains data. How does a data catalog work? How does a catalog help organizations get a handle on their data and more importantly, use it to make decisions and drive business value? The next page shows a simple graph that outlines how a data catalog solution can work to deliver business outcomes.
Here are the 5 stages showing how a data catalog can deliver on the business outcome "I want to delight my customers":
With the tremendous growth in the volume of data, increased access to multiple data sources, along with new compliance regulations—organizations are working to “get a handle” on their enterprise-wide data. They must be able to answer the questions:
As a result, data catalog solutions have gone from being a “nice to have” to a “must have” in the arsenal of data governance capabilities. In a recent research report Data Catalogs are the New Black in Data Management and Analytics Research1 , Gartner reports that demand for data catalogs is soaring as organizations struggle to inventory their distributed data assets to facilitate data monetization and conform to regulations.
If you find yourself saying the following, you may need a data catalog (or data catalog + governance) solution:
Many organizations are asking how they can get more value from analytics and have better visibility into their data. The introduction of IoT and Digital Transformation have resulted in an abundance of data. Now organizations need to find the available data and confirm it’s trusted so it can be used for decision making.
There has been a surge in the investment in B.I. software. Locating the right data for analysis and reporting is a challenge that must be solved for when implementing B.I. Some organizations are able to locate their data, but cannot identify the source to confirm it’s valid. Still others are finding conflicting results between two different reports.
Your data lake seemed to be the answer to all of your problems. However now business stakeholders are not able to access the information they need from the data lake. No one is certain what data exists in the lake and how to access it.
There’s a lot of concern around GDPR and growing scrutiny around consumer privacy. If a customer requests to exercise their data subjects right—like the right to be forgotten—can you quickly accomplish this—and locate all available personal data?
As A.I. moves into the mainstream, organizations are finding that identifying the right data to inform the algorithm is critical. This applies to the input data along with the features of the data itself, including tagging the data, having the right metadata, user data etc. The first step in this process is therefore to discover and catalog the data.
Data catalogs should be easily implemented within a few weeks to months. However, there are a few reasons why companies might experience more painful, timely projects. If you have done your due diligence and selected a data catalog that is cloud-based, “on the stack” and aligned with your EIM and enterprise metadata management strategies, then it should be smooth sailing. However, if you have decided on a catalog that requires up front customization, specific hardware or a team of specialized developers then you might be looking at a costly project.
Vendors want to sell you their solution. So sometimes weakness and limitations are glossed over. It is your job to make sure that you aren’t falling for “market-tecture”. When deciding on a catalog, check popular review sites like Gartner Peer Insights, speak with analysts and make sure you ask references about implementation.
According to Gartner, companies should “Avoid data catalogs that do not have the ability to scale out beyond tactical use case requirements and connect to the broader enterprise metadata management and EIM initiatives.”1 Some companies are choosing data catalogs based on a single, tactical use case, like to inventory the data in their data lakes for instance. It’s important to understand that deploying a catalog for one tool or use will improve data usability, trust and shareability ONLY for that specific tool. This ultimately creates the need for a data catalog of all the data catalogs in your architecture. This is not the way to enable effective monetization in the long term. Before selecting a data catalog for one specific use case, make sure that you have evaluated options that span across use cases and are connected to your broader EIM needs.
Some catalogs are built for a more technically minded user who is using SQL. These catalogs have some high-tech capabilities and provide a full picture into the technical lineage and providence of every bit of data in the ecosystem. Others are built more for business users that don’t care about SQL or about technical lineage, but rather, want to see the data that matters for the initiative they care about in a user friendly way. Who is going to be using your catalog and for what reason? Make sure that you don’t try to force your business users into being IT coding experts. This could cause serious issues with adoption and ROI.
It’s important to spend the time up front to identify what functionality is important to your organization. You might find that different groups have different needs. Having this list defined when you start your search will help ensure you’re selecting the right solution. At a bare minimum, data catalogs should be able to:
Once you’ve checked the box on that basic functionality, there are a few other things you should consider to ensure your catalog can be used to add business value in the future. Will it provide realtime integration with your data sources, so that you are continuously populating the data catalog with the data that is critical to you?
Recent research reports outline what to look for in a data catalog and governance solution, and how all of the vendors stack up:
Data Catalogs Are the New Black in Data Management and Analytics by Ehtisham Zaidi, Guido De Simoni, Roxane Edjlali, Alan D. Duncan, December 13 2017.
Forrester Wave for Data Governance, Stewardship and Discovery Software by Henry Peyret with Alex Cullen, Alex Kramer, and Sam Bartlett, June 26 2017. Download a complimentary report.
By focusing on what data matters and why,
A data catalog is just the beginning. What will you accomplish with better access to your data?