What is a Data Catalog? (2024)

Simply put, a data catalog is an organized inventory of data assets in theorganization. It uses metadata to help organizations manage their data. It alsohelps data professionals collect, organize, access, and enrich metadata to supportdata discovery and governance.

Discover OCI Data Catalog

Data Catalog Definition and Analogy

We gave a short definition of a data catalog above, as something that uses metadata tohelp organizations manage their data. But let’s expand upon that with the analogy of alibrary.

When you go to a library and you need to find a book, you use their catalog to discoverwhether the book is there, which edition it is, where it’s located, adescription—everything you need so that you can decide whether you want it, and if youdo, how to go and find it.

That’s what many object stores, databases, and data warehouses offer today.

But now, think back to the analogy of that library and the catalog. And now expand thepower of that catalog to cover every library in the country. Imagine that you have justone interface and suddenly, you can find every single library in the country that hasthe copy of the book you’re seeking, and you can find all the details you’d ever want oneach one of those books.

That’s what an enterprise data catalog does for all of your data. It gives you a single,overarching view and deeper visibility into all of your data, not just each data storeat a time.

Perhaps you might wonder—why would you need a view like that?

Challenges a Data Catalog Can Address

With more data than ever before, being able to find the right data has become harder thanit ever has been. At the same time, there are also more rules and regulations than everbefore—with GDPR being just one of them.

So not only is data access becoming a challenge, but data governance has become achallenge as well. It’s critical to understand the kind of data that you have now, whois moving it, what it’s being used for, and how it needs to be protected. But you alsohave to avoid putting too many layers and wrappers around your data—because data isuseless if it’s too difficult to be used.

Unfortunately, there are many challenges with finding and accessing the right data. Theseinclude:

  • Wasted time and effort on finding and accessing data
  • Data lakes turning into data swamps
  • No common business vocabulary
  • Hard to understand structure and variety of “dark data”
  • Difficult to assess provenance, quality, trustworthiness
  • No way to capture tribal or missing knowledge
  • Difficult to reuse knowledge and data assets
  • Manual and ad-hoc data prep efforts

Data Catalog Users

All of these data management issues frustrate users such as data engineers, datascientists, data stewards, and chief data officers. All of these groups of people wanteasy access to trusted data. Here are just a few of the challenges that they face:

Data engineers want to know how any changes will affect the system as a whole. They mightask:

  • What will be the impact of a schema change in our CRM application?
  • How different are the Peoplesoft and HCM data structures?

Data scientists want easy access to data and they want to know more about the quality ofthe data. They are looking for information such as:

  • Where can I find and explore some geo-location data?
  • How can I easily access the data in the data lake?

Data stewards are charged with a managed data process. They care about concepts,agreements between stakeholders, and managing the lifecycle of the data itself. Theywill ask questions such as:

  • Are we really improving the quality of our operational data?
  • Have we defined standards for important key data elements?

Chief Data Officers care about who is doing what in the organization. They’re typicallynot the ones using a data catalog, but they still want to know answers to questions suchas:

  • Who can access customers’ personal information?
  • Do we have retention policies defined for all data?

Enter the data catalog.

Data Catalog Use Cases

In the past few years, the concept of a data catalog has become popular because ofthe increasingly large amounts of data that now have to be managed and accessed.Cloud, big data analytics, AI andmachine learning have started to change the way we need to see, manage, and leverageour data—and not just manage of it, but be able to fully use and access it.

Using a data catalog the right way means better data usage, all of which contributesto:

  • Cost savings
  • Operational efficiency
  • Competitive advantages
  • Better customer experience
  • Fraud and risk advantage
  • And so much more

Here are just a few of the use cases for a data catalog. But really, a data catalogcan be used in so many ways because fundamentally, it’s about having widervisibility and deeper access to your data.

Self-service analytics. Many data users have trouble finding theright data. And not just finding the right data but understanding whether it’suseful. You might discover a file called customer_info.csv. And you might need afile about customers. But that doesn’t mean it’s the right one because it can be oneof 50 such similar files. The file may have many fields and you may not understandwhat all of those data elements are. You’ll want an easier way to see the businesscontext around it, such as whether it’s a managed resource, from the right datastore, or what the relationship is with other data artifacts.

Discovery could also entail understanding the shape and characteristics of data, fromsomething as simple as value distribution, statistical information, or something asimportant and complex as Personally Identifiable Information (PII) or PersonalHealth Information (PHI).

Audit, compliance, and change management. With ever-increasinggovernment regulations around data, you often need to demonstrate the provenance ofdata—whether certain data artifacts are coming from this source or that source, orhow it’s getting transformed before reaching whatever the final target is. Whenlooking at a table, report, or file, your data users often want to understand wherethe data is coming from and how it’s moving through the organization in variousways. From a change management perspective, it’s important to view how changes inone part of a data pipeline affect other parts of the system. This is why customersseek detailed data lineage.

Supporting data governance with business glossaries. Mostorganizations have a vocabulary that everyone agrees on and a consistentunderstanding that they can use for business concepts. But often, it’s recorded inExcel sheets lying around somewhere—and that’s if the organization is lucky. A datacatalog is a much better place where you can store and manage this vital businessinformation.

A data catalog also allows you to establish links between business terms to establisha taxonomy. Beyond that, it can record relationships between terms and physicalassets such as tables and columns. It also enables users to understand whichbusiness concepts are relevant to which technical artifacts. This can be used toclassify data assets along business concept lines and then actually use businessconcepts instead of technical names for search and discovery. This helps byincreasing user trust in what they’re looking at, because they can see everythingthat’s related to their data and it’s often a good starting point for datagovernance.

What Is Needed to Fully Make Use of Data in a Data Catalog?

So let’s take a step back and quickly explain metadata to those who might not be entirelyfamiliar with it. What is metadata? There are three kinds of metadata:

  • Technical metadata: Schemas, tables, columns, file names, report names – anythingthat is documented in the source system
  • Business metadata: This is typically the business knowledge that users have aboutthe assets in the organization. This might include business descriptions, comments,annotations, classifications, fitness-for-use, ratings, and more.
  • Operational metadata: When was this object refreshed? Which ETL job created it? Howmany times has a table been accessed by users—and which one?

In the past few years, we’ve seen a mini-revolution on how we can use this valuablemetadata. Once, metadata was mostly used only for audit, lineage, and reporting only.But today, technological innovations like serverless processing, graph databases, andespecially new or more accessible AI and machine learning techniques are pushing theboundaries and making things possible with metadata that simply weren’t possible at thisscale before.

Today, metadata can be used to augment data management. Everything from self-service datapreparation to role-and-data content-base access control, . Automated data onboarding,Monitoring and alerting anomalies. Auto-provisioning and auto-scaling resources etc..All of this can now be augmented with the help of metadata.

And the data catalog uses metadata to help you achieve more than ever with your datamanagement.

What Should a Data Catalog Offer?

A good data catalog should offer:

Search and discovery. A data catalog should have flexible searching andfiltering options to allow users to quickly find relevant sets of data for data science, analytics or data engineering. Or browsemetadata based on a technical hierarchy of data assets. Enabling users to entertechnical information, user defined tags, or business terms also improves the searchcapabilities.

Harvest metadata from various sources. Make sure your data catalog canharvest technical metadata from a variety of connected data assets, including objectstorage, self-driving databases, on-premises systems, and much more.

Metadata curation. Provide a way for subject matter experts tocontribute business knowledge in the form of an enterprise business glossary, tags,associations, user-defined annotations, classifications, ratings, and more.

Automation and data intelligence. At the data scales that we mentioned,AI and machine learning are often a must. Any and all manual tasks that can be automatedshould be automated with AI and machine learning techniques on the collected metadata.In addition, AI and machine learning can begin to truly augment capabilities with data,such as providing data recommendations to data catalog users and the users of otherservices in a modern data platform.

Enterprise-class capabilities. Your data is important, and you needenterprise-class capabilities to use it properly, such as identity and accessmanagement, and main capabilities via REST APIs. This would also mean that customers andpartners can contribute metadata (such as custom harvesters) and also expose datacatalog capabilities in their own applications via REST.

In addition to all of that, your data catalog should become your de-facto system catalog,providing abstraction across all of your persistence layers like object store, Hadoop,databases, data warehouse, and for querying services that work across all of your datastores.

And that’s also why a data catalog is no longer a nice to have. It’s a necessity.

Why Oracle Cloud Infrastructure Data Catalog?

Every organization should have a strong data catalog. But why do you want Oracle CloudInfrastructure Data Catalog?

Oracle Cloud Infrastructure Data Catalog is included with all Oracle Cloud Infrastructuresubscriptions and helps customers organize and govern their data assets. It is a singlecollaborative solution for data professionals to not just organize and govern data, butalso collect, access, enrich, and activate technical, business, and operational metadatato support self-service data discovery and governance for trusts data assets in OracleCloud and beyond.

From a practical level, it will:

  • Harvest technical metadata about data assets on Oracle Cloud Infrastructure such as Oracle Cloud Infrastructure Object Storage, Oracle Autonomous Database, Oracle Database.
  • Search and explore appropriate data from variety of different sources through multi-faceted search and filters
  • Manage business glossary to capture business vocabulary of the enterprise
  • Enrich understanding of available data by capturing tribal knowledge in the form user defined tags and annotation
  • Gain a holistic view of data assets by associating tags and business terms
  • Integrate capabilities into other apps using REST APIs and SDKs
  • Secure access with IAM group based policies

Conclusion

Organizations are striving to be data-driven. They want better, faster analytics, withoutsacrificing governance. And that’s what is making data management even more importantand challenging. A data catalog helps make data management easier to manage, and itmakes fulfilling the many demands easier. Through Oracle Cloud Infrastructure DataCatalog, Oracle has taken steps to help everyone discover and use data in the waythey’ve always wanted.

Try Oracle Cloud Free Tier

What is a Data Catalog? (2024)
Top Articles
Latest Posts
Article information

Author: Sen. Emmett Berge

Last Updated:

Views: 6085

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.