Data Catalog overview  |  Data Catalog Documentation  |  Google Cloud (2024)

Dataplex's Data Catalog feature is a central inventoryof an organization's data assets. Data Catalog automaticallycatalogs metadata from Google Cloud sources such as BigQuery,Vertex AI, Pub/Sub, Spanner, Bigtable,and more. Data Catalog also indexes table and fileset metadatafrom Cloud Storage through discovery.

You can discover data with Dataplex's governed organization-widemetadata search capability. You can further enrich metadata with criticalbusiness context, and enable lineage tracking, data profiling, data qualitychecks, and access control capabilities.

Using Data Catalog, organizations can achieve better datadiscovery, metadata management, and governance.

Why do you need Data Catalog?

Most organizations today are dealing with a large and growing number of dataassets.

Data stakeholders (consumers, producers, and administrators) within anorganization face multiple challenges:

  • Searching for insightful data:

    • Data consumers don't know the location and origin of data. They have tonavigate data "swamps".
    • Data consumers don't know what data to use to get insights because most datais not well documented and, even if documented, is not well maintained.
    • Data can't be found and is often lost when it resides only in people'sminds.
  • Understanding data:

    • Is the data fresh, clean, validated, approved for use in production?
    • Which dataset out of several duplicate sets is relevant and up-to-date?
    • How does one dataset relate to another?
    • Who is using the data and who is the owner?
    • Who and what processes are transforming the data?
  • Making data useful:

    • Data producers don't have an efficient way to put forward their data forconsumers. If there's no self-service, consumers may overwhelm producers.Several data engineers can't manually provide data to thousands of dataanalysts.

    • Valuable time is lost if data consumers have to find out how to requestdata access, wait without a defined response time, escalate, and wait again.

Without the right tools, the challenges become a major obstacleto the efficient use of data. Data Catalog provides a centralizedrepository that lets organizations achieve the following:

  • Gain a unified view to reduce the pain of searching for the right data.
  • Support data-driven decision making and accelerate the insight time by enriching data with technical and business metadata.
  • Improve data management to increase operational efficiency andproductivity.
  • Take ownership over the data to improve trust and confidence in it.

Data Catalog functions

Data Catalog provides three main functions:

  • Searching for data entries for which you have access
  • Tagging data entries with metadata
  • Providing column-level securityfor BigQuery tables

In addition, Data Catalog can build on the results of aSensitive Data Protection scan to identify sensitivedata directly within Data Catalog in the form of tag templates.

How Data Catalog works

Data Catalog can catalog asset metadata from different Google Cloud systems.

You can also use Data Catalog APIs to integrate with custom data sources.

After your data is cataloged, you can add your own metadata to these assets using tags.

Data Catalog overview | Data Catalog Documentation | Google Cloud (1)

Data Catalog handles two types of metadata: technical metadata and business metadata. To know more about metadata, see Data Catalog metadata.

Search and discovery

Data Catalog offers a powerful predicate-based searchexperience for technical and business metadata associated with a data entry. Youmust have the permissions to read the metadata for a data entry so that you canapply search and discovery on the metadata. Data Catalog does notindex the data within a data entry. Data Catalog only indexes themetadata that describes an asset.

Data Catalog controls some metadata such as user-generated tags.For all metadata sourced from the underlying storage system,Data Catalog is a read-only service that reflects the metadataand permissions provided by the underlying storage system. You can make edits inthe underlying storage system to add, update, or delete the metadata of a dataentry.

To know more about Data Catalog search, seeSearch for data assets with Data Catalog.

Automatic cataloging of assets

For a given project, Data Catalog automatically catalogs thefollowing Google Cloud assets:

  • Analytics Hub linked datasets
  • BigQuery datasets, tables, models, routines, and connections
  • Bigtable instances, clusters, and tables (including column family details)
  • Dataplex lakes, zones, tables, and filesets
  • Dataproc Metastore services, databases, and tables
  • Pub/Sub topics
  • Spanner instances, databases, tables, and views
  • Vertex AI models,datasets, andVertex AI Feature Store resources

In addition to cataloging assets within the project IDs for which you have metadata access, Data Catalog can catalog data stored in the BigQuery projectsthat contain public datasets.

Catalog non-Google Cloud assets

To catalog metadata from non-Google Cloud systems in your organization, you can use thefollowing:

  • Community-contributed connectorsto multiple popular on-premises data sources
  • Manually build on theData Catalog APIs for custom entries

Access Data Catalog

You can access Data Catalog functionalities using:

  • Dataplex in the Google Cloud console

  • gcloud command-line interface (CLI)

  • Data Catalog APIs

  • Cloud Client Libraries

What's next

  • To get started with Data Catalog tagging, seeCreate tag templates, tags, overviews, and data stewards.

  • To get started with Data Catalog search and discovery, seeSearch and view data assets with Data Catalog.

  • To integrate your data sources, follow the steps inIntegrate Google Cloud and on-premises data sources.

Data Catalog overview  |  Data Catalog Documentation  |  Google Cloud (2024)

FAQs

What is a data catalog in Google Cloud? ›

Data Catalog allows you to discover, manage, and understand data assets across Google Cloud Platform. Data Catalog API natively indexes Cloud BigQuery, Cloud Storage, and Cloud Pub/Sub data assets. The Data Catalog API can be used to: Search for data assets across different projects and GCP resources.

What is the difference between metadata catalog and data catalog? ›

A data catalog is an organized list of all the data assets which empower data teams throughout the company. Metadata management helps organizations decide how to collect, analyze, and maintain contextual information — metadata. It serves as an organized data inventory for all data sources.

What is included in a data catalog? ›

A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness of data for intended uses.

What is the difference between data catalog and data set? ›

A data catalog references an organization's datasets in various categories for search and discovery. It helps map an organization's data, primarily for compliance with regulations (GDPR / CCPA). It enables data search and discovery of data assets, with the right context.

What does a good data catalogue look like? ›

A good data catalog uses capabilities such as search, filters, and recommendations to make finding the right data simple regardless of a user's technical knowledge. Data exploration. Sometimes, users need to dive deeper to find related data or mine existing data for insights.

What's the difference between catalog and catalogue? ›

Both “catalogue” and “catalog” are correct spellings of the word, but their usage depends on the variant of English being used: “Catalogue” is the preferred spelling in British English and some other varieties of English. “Catalog” is the preferred spelling in American English.

What is the difference between data schema and data catalog? ›

Catalogue: This is the highest level of organization within a database. A catalogue holds one or more schemas and represents the complete set of schemas that a user or application can access. In essence, a catalogue is a database. Schema: Within a catalogue (or database), you have schemas.

What is the difference between data inventory and data catalog? ›

While these terms are often used interchangeably, they are not the same and perform different functions. A data inventory is a unique set of data detailing the location and type of each data point in a company's collection. A data catalog allows users to locate those datasets by referencing them in various categories.

What is the difference between data catalog and master data? ›

A data catalog is the backbone of modern data management, enabling organizations to find, understand, trust, and use their data effectively. On the other hand, master data management (MDM) is a method of managing the core data of an organization.

What is the core aim of a data catalogue? ›

Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.

Do you really need a data catalog? ›

At its core, a data catalog serves as an organized inventory of all the data assets within an organization. Data catalogs play a crucial role in both deriving value from data and ensuring proper data governance.

Which two are capabilities of a data catalog? ›

Data Catalog Key Capabilities

Harvest technical metadata from a wide range of supported data sources that are accessible using public or private IPs. Create and manage a common enterprise vocabulary with a business glossary.

What is the difference between data catalog and data documentation? ›

The main difference between a data catalog and a data dictionary is that a data dictionary documents technical metadata for a specific database, whereas a data catalog acts as a unified context, control, and collaboration layer of all metadata (technical, governance, operational, collaboration, quality, and usage) ...

Why build a data catalog? ›

Data catalogs make data more visible and understandable and enable self-service access. An intelligent data catalog offers end-to-end visibility into data sources and lineage. This self-sufficiency delivers greater productivity and user satisfaction.

What is the difference between metadata and data catalog? ›

Whereas metadata describes data characteristics like structure, format, and content, a data catalog is a software tool used to manage and organize metadata about data assets within an organization, which facilitates a range of use cases.

What is the difference between data catalog and data exchange? ›

Data catalogs are useful to manage and govern the entire data estate and support compliance and governance requirements. Data exchanges curate the subset of that data that can be most useful in driving business insights and outcomes from data.

What is the difference between data catalog and data dictionary? ›

The main difference between a data catalog and a data dictionary is that a data dictionary documents technical metadata for a specific database, whereas a data catalog acts as a unified context, control, and collaboration layer of all metadata (technical, governance, operational, collaboration, quality, and usage) ...

What is the difference between data catalog and data lineage? ›

With advanced technologies like artificial intelligence (AI), data lineage can be automatically tracked and visualized, making it easier for data teams to understand the flow of data and identify any potential bottlenecks or risks. Data cataloging involves organizing and categorizing data assets within an organization.

What is the difference between data catalog and data lake? ›

A data catalog is exactly as it sounds: it is a catalog for all the big data in a data lake. By applying metadata to everything within the data lake, data discovery and governance become much easier tasks.

Top Articles
Latest Posts
Article information

Author: Clemencia Bogisich Ret

Last Updated:

Views: 5465

Rating: 5 / 5 (80 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Clemencia Bogisich Ret

Birthday: 2001-07-17

Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

Phone: +5934435460663

Job: Central Hospitality Director

Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.