Data Catalog overview | Data Catalog Documentation

Dataplex's Data Catalog feature is a central inventoryof an organization's data assets. Data Catalog automaticallycatalogs metadata from Google Cloud sources such as BigQuery,Vertex AI, Pub/Sub, Spanner, Bigtable,and more. Data Catalog also indexes table and fileset metadatafrom Cloud Storage through discovery.

You can discover data with Dataplex's governed organization-widemetadata search capability. You can further enrich metadata with criticalbusiness context, and enable lineage tracking, data profiling, data qualitychecks, and access control capabilities.

Using Data Catalog, organizations can achieve better datadiscovery, metadata management, and governance.

Why do you need Data Catalog?

Most organizations today are dealing with a large and growing number of dataassets.

Data stakeholders (consumers, producers, and administrators) within anorganization face multiple challenges:

Searching for insightful data:
- Data consumers don't know the location and origin of data. They have tonavigate data "swamps".
- Data consumers don't know what data to use to get insights because most datais not well documented and, even if documented, is not well maintained.
- Data can't be found and is often lost when it resides only in people'sminds.
Understanding data:
- Is the data fresh, clean, validated, approved for use in production?
- Which dataset out of several duplicate sets is relevant and up-to-date?
- How does one dataset relate to another?
- Who is using the data and who is the owner?
- Who and what processes are transforming the data?
Making data useful:
See Also
What is a Data Catalog?Data Catalog (Datenkatalog) – Funktionen & Vorteile Was ist ein Data Catalog?
- Data producers don't have an efficient way to put forward their data forconsumers. If there's no self-service, consumers may overwhelm producers.Several data engineers can't manually provide data to thousands of dataanalysts.
- Valuable time is lost if data consumers have to find out how to requestdata access, wait without a defined response time, escalate, and wait again.

Without the right tools, the challenges become a major obstacleto the efficient use of data. Data Catalog provides a centralizedrepository that lets organizations achieve the following:

Gain a unified view to reduce the pain of searching for the right data.
Support data-driven decision making and accelerate the insight time by enriching data with technical and business metadata.
Improve data management to increase operational efficiency andproductivity.
Take ownership over the data to improve trust and confidence in it.

Data Catalog functions

Data Catalog provides three main functions:

Searching for data entries for which you have access
Tagging data entries with metadata
Providing column-level securityfor BigQuery tables

In addition, Data Catalog can build on the results of aSensitive Data Protection scan to identify sensitivedata directly within Data Catalog in the form of tag templates.

How Data Catalog works

Data Catalog can catalog asset metadata from different Google Cloud systems.

You can also use Data Catalog APIs to integrate with custom data sources.

After your data is cataloged, you can add your own metadata to these assets using tags.

Data Catalog overview | Data Catalog Documentation | Google Cloud (1)

Data Catalog handles two types of metadata: technical metadata and business metadata. To know more about metadata, see Data Catalog metadata.

Search and discovery

Data Catalog offers a powerful predicate-based searchexperience for technical and business metadata associated with a data entry. Youmust have the permissions to read the metadata for a data entry so that you canapply search and discovery on the metadata. Data Catalog does notindex the data within a data entry. Data Catalog only indexes themetadata that describes an asset.

Data Catalog controls some metadata such as user-generated tags.For all metadata sourced from the underlying storage system,Data Catalog is a read-only service that reflects the metadataand permissions provided by the underlying storage system. You can make edits inthe underlying storage system to add, update, or delete the metadata of a dataentry.

To know more about Data Catalog search, seeSearch for data assets with Data Catalog.

Automatic cataloging of assets

For a given project, Data Catalog automatically catalogs thefollowing Google Cloud assets:

Analytics Hub linked datasets
BigQuery datasets, tables, models, routines, and connections
Bigtable instances, clusters, and tables (including column family details)
Dataplex lakes, zones, tables, and filesets
Dataproc Metastore services, databases, and tables
Pub/Sub topics
Spanner instances, databases, tables, and views
Vertex AI models,datasets, andVertex AI Feature Store resources

In addition to cataloging assets within the project IDs for which you have metadata access, Data Catalog can catalog data stored in the BigQuery projectsthat contain public datasets.

Catalog non-Google Cloud assets

To catalog metadata from non-Google Cloud systems in your organization, you can use thefollowing:

Community-contributed connectorsto multiple popular on-premises data sources
Manually build on theData Catalog APIs for custom entries

Access Data Catalog

You can access Data Catalog functionalities using:

Dataplex in the Google Cloud console
gcloud command-line interface (CLI)
Data Catalog APIs
Cloud Client Libraries

What's next

To get started with Data Catalog tagging, seeCreate tag templates, tags, overviews, and data stewards.
To get started with Data Catalog search and discovery, seeSearch and view data assets with Data Catalog.
To integrate your data sources, follow the steps inIntegrate Google Cloud and on-premises data sources.

Data Catalog overview | Data Catalog Documentation | Google Cloud (2024)

FAQs

What is a data catalog in Google Cloud? ›

Data Catalog allows you to discover, manage, and understand data assets across Google Cloud Platform. Data Catalog API natively indexes Cloud BigQuery, Cloud Storage, and Cloud Pub/Sub data assets. The Data Catalog API can be used to: Search for data assets across different projects and GCP resources.

Know More ›

What is the difference between metadata catalog and data catalog? ›

A data catalog is an organized list of all the data assets which empower data teams throughout the company. Metadata management helps organizations decide how to collect, analyze, and maintain contextual information — metadata. It serves as an organized data inventory for all data sources.

See Details ›

What is included in a data catalog? ›

A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness of data for intended uses.

Discover More Details ›

What is the difference between data catalog and data set? ›

A data catalog references an organization's datasets in various categories for search and discovery. It helps map an organization's data, primarily for compliance with regulations (GDPR / CCPA). It enables data search and discovery of data assets, with the right context.

Get More Info ›

What does a good data catalogue look like? ›

A good data catalog uses capabilities such as search, filters, and recommendations to make finding the right data simple regardless of a user's technical knowledge. Data exploration. Sometimes, users need to dive deeper to find related data or mine existing data for insights.

What's the difference between catalog and catalogue? ›

Both “catalogue” and “catalog” are correct spellings of the word, but their usage depends on the variant of English being used: “Catalogue” is the preferred spelling in British English and some other varieties of English. “Catalog” is the preferred spelling in American English.

Tell Me More ›

What is the difference between data schema and data catalog? ›

Catalogue: This is the highest level of organization within a database. A catalogue holds one or more schemas and represents the complete set of schemas that a user or application can access. In essence, a catalogue is a database. Schema: Within a catalogue (or database), you have schemas.

Know More ›

What is the difference between data inventory and data catalog? ›

While these terms are often used interchangeably, they are not the same and perform different functions. A data inventory is a unique set of data detailing the location and type of each data point in a company's collection. A data catalog allows users to locate those datasets by referencing them in various categories.

Get More Info ›

What is the difference between data catalog and master data? ›

A data catalog is the backbone of modern data management, enabling organizations to find, understand, trust, and use their data effectively. On the other hand, master data management (MDM) is a method of managing the core data of an organization.

Keep Reading ›

What is the core aim of a data catalogue? ›

Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.

Know More ›

Do you really need a data catalog? ›

At its core, a data catalog serves as an organized inventory of all the data assets within an organization. Data catalogs play a crucial role in both deriving value from data and ensuring proper data governance.

View Details ›

Which two are capabilities of a data catalog? ›

Data Catalog Key Capabilities

Harvest technical metadata from a wide range of supported data sources that are accessible using public or private IPs. Create and manage a common enterprise vocabulary with a business glossary.

Know More ›

What is the difference between data catalog and data documentation? ›

The main difference between a data catalog and a data dictionary is that a data dictionary documents technical metadata for a specific database, whereas a data catalog acts as a unified context, control, and collaboration layer of all metadata (technical, governance, operational, collaboration, quality, and usage) ...

View Details ›

Why build a data catalog? ›

Data catalogs make data more visible and understandable and enable self-service access. An intelligent data catalog offers end-to-end visibility into data sources and lineage. This self-sufficiency delivers greater productivity and user satisfaction.

Know More ›

What is the difference between metadata and data catalog? ›

Whereas metadata describes data characteristics like structure, format, and content, a data catalog is a software tool used to manage and organize metadata about data assets within an organization, which facilitates a range of use cases.

Show Me More ›

What is the difference between data catalog and data exchange? ›

Data catalogs are useful to manage and govern the entire data estate and support compliance and governance requirements. Data exchanges curate the subset of that data that can be most useful in driving business insights and outcomes from data.

Data Catalog overview | Data Catalog Documentation | Google Cloud (2024)

Why do you need Data Catalog?

Data Catalog functions

How Data Catalog works

Search and discovery

Automatic cataloging of assets

Catalog non-Google Cloud assets

Access Data Catalog

What's next

FAQs

What is a data catalog in Google Cloud? ›

Do you really need a data catalog? ›