
What Is a Data Catalog?
By Michelle Knight on
A data catalog centralizes access to all of an organization’s available data assets through a metadata inventory. This repository facilitates dataset search and retrieval so that users and systems can easily find the information needed for business. A data catalog differs from a data dictionary in its ability to search and retrieve information.
Data catalogs “may have started as little more than repositories for database schema, sometimes accompanied by business documentation around the database tables and columns,” according to Oksana Sokolovsky and Rohit Mahajan of Io-Tahoe. But, “instead of looking up a table name and reading its description, users and systems can search for business entities. Then, these people and machines can find related datasets to quickly perform analysis and derive insights.”
While business terms, found in a data catalog, can also be found in business glossaries, a data catalog looks more like a directory. Data catalogs assume users already know or have easy access to business definitions. Data catalogs’ self-service capabilities make them valuable in business intelligence. In addition, customized data catalogs can speed up data computing and storage, making datasets more readily available for use.
Use Data Catalog Use Cases Include:
- The World Bank designed a data catalog to make its “development data easy to find, download, use, and share.”
- Harvard Open Door Project (HODP) was created “to increase transparency and solve problems on campus.”
- IBM Watson connected customer data and advertising information for an automotive company to better target the right audiences at the right time.
- Kansas City, Mo., used an open data catalog to “drive decisions to save money through more efficient repairs and maintenance of streets, water lines, and other infrastructure.”
- Financial Industry Regulatory Authority (FINRA) created a data catalog “that stores technical metadata to support querying and data fixes. In addition, it features a UI that allows data scientists and other consumers to explore the data sets.”

Other Data Catalog Definitions Include:
- “Business-oriented directories that help users find the data they need, quickly” (Sokolovsky and Mahajan)
- “Solution designed for business users to solve data-centric issues that hold decisions, business processes and outcomes hostage” (TechRepublic)
- Accessible data for self-service analytics and Data Science initiatives through a 360-degree view (IBM)
- A platform to share and discover otherwise hard-to-find datasets while keeping ultimate control over the data in researchers’ hands (Health Sciences Library System, (University of Pittsburgh)
- “A searchable and browsable online collection of data sets” (NYU Health Sciences Library)
Businesses Need Data Catalogs To:
- Unlock the potential of a data lake by applying a universal schema to enhance discoverability.
- “Utilize, enrich, manage, and value a company’s information.”
- Find and classify data at scale.
- Drive digital transformation such as Machine Learning and AI.
- Enhance marketing, sales, operations and just about every other area of an organization.
- “Improve data visibility and better enforce data security policies.”
- “Allow any users, from analysts to data scientists or developers, to discover and consume data sources.”
Image used under license from Shutterstock.com