What is Data Lineage?

2022-10-262022-10-26No tags

152views

What is data lineage? Techniques, best practices and tools

Organizations can bolster data governance efforts by tracking the lineage of data in their systems. Get advice on how to do so and how data lineage tools can help.

By Craig Stedman, Industry Editor David Loshin, Knowledge Integrity Published: 28 Jul 2022

Data lineage documents the journey that data takes through an organization’s IT systems, showing how it flows between them and gets transformed for different uses along the way. It uses metadata — data about the data — to enable both end users and data management professionals to track the history of data assets and get information about their business meaning or technical attributes.

For example, data lineage records can help data scientists, other data analysts and business users understand the data they work with and ensure that it’s relevant to their information needs. Data lineage also plays a valuable role in data governance, master data management and regulatory compliance programs. Among other aspects of those initiatives, it simplifies two critical data governance procedures: analyzing the root causes of data quality issues and the impact of changes to data sets.

Data lineage information is collected from operational systems as data is processed and from the data warehouses and data lakes that store data sets for BI and analytics applications. In addition to the detailed documentation, data flow maps and diagrams can be created to provide visualized views of data lineage mapped to business processes. To simplify end-user access to lineage information, it’s often incorporated into data catalogs, which inventory data assets and the metadata associated with them.

Why is data lineage important?

Information on data lineage is crucial to data management and analytics efforts. Lineage details shine a light on data to help organizations manage and use it effectively. Without access to those details, it becomes much harder to take full advantage of data’s potential business value.

The following are some of the benefits that data lineage provides.

More accurate and useful analytics. By letting analytics teams and business users know where data comes from and what it means, data lineage improves their ability to find the data they need for BI and data science uses. That leads to better analytics results and makes it more likely that data analysis work will deliver meaningful information to drive business decision-making.
Stronger data governance. Data lineage also aids in tracking data and carrying out other key parts of the governance process. It helps data governance managers and team membersmake sure that data is valid, clean and consistent and that it’s secured, managed and used properly.
Tighter data security and privacy protections. Organizations can use data lineage information to identify sensitive data that requires particularly strong security. It can also be used to set different levels of user access privileges based on security and data privacy policies and to assess potential data risks as part of an enterprise risk management strategy.
Improved regulatory compliance. Better security protections informed by data lineage can help organizations ensure that they comply with data privacy laws and other regulations. Well-documented data lineage also makes it easier to conduct internal compliance audits and report on compliance levels.
Streamlined data management. In addition to data quality improvement, data lineage boosts a variety of other data management tasks. Examples include managing data migrations, breaking down data silos and detecting and addressing gaps in data sets.

Sample diagram of the data lineage process — This shows how data lineage information is extracted from metadata collected as data moves through IT systems.

Data lineage vs. data classification and data provenance

Data lineage is also closely aligned with data classification and data provenance, two other data management processes. Here’s a look at what they are and how they differ from and relate to data lineage.

Data classification. This involves assigning data to different categories based on its characteristics, primarily for security and compliance purposes. Classification is used to categorize data based on how sensitive it is — for example, as personal, proprietary, confidential or public information. Doing so separates data sets that need higher levels of security and more restrictive access controls from ones that don’t. Data lineage provides information about data sets that can aid in classifying them.

Data provenance. Sometimes considered to be synonymous with data lineage, data provenance alternatively is seen as being more narrowly focused on the origins of data, including its source system and how it’s generated. In that context, data lineage and data provenance can work hand in hand, with the latter providing high-level documentation of where data comes from and what it entails.

Data lineage and data governance

The essence of data governance is creating corporate data policies and ensuring that people comply with them. Such policies can span an array of intents, including directives on data protection, validation and usage. Data governance managers and data stewards must solicit data requirements from business users and work with members of the decision-making data governance committee to agree on common data definitions, specify data quality metrics and develop the policies and associated procedures.

It’s a big challenge, though, to bridge the gap between defining data governance policies and implementing them. That’s where data lineage comes in. It documents data sources and flows, enabling governance teams to monitor how data moves through systems and is modified and used. The lineage information helps them ensure that proper data security and access controls are in place and that data is stored, maintained and used in accordance with governance policies.

Data lineage can also ease specific governance-related tasks. For example, without a way to determine where data errors are introduced into systems, it’s difficult for data stewards and data quality analysts to identify and fix them. That has consequences: If data flaws aren’t caught, an organization may be plagued by inconsistent or inaccurate analytics results that lead to bad business decisions.

In root cause analysis of data errors, lineage records provide visibility into the sequence of processing stages a data set goes through. Quality levels can be examined at each stage to find where data errors originate. Working backward from where an error is first identified, a data steward can check whether the data conformed to expectations at earlier points or included the error then. By pinpointing the stage at which the data was compliant upon entry but flawed upon exit, workers involved in a data governance program can eliminate the error’s root cause instead of just correcting the bad data.

Data lineage is useful, too, in doing impact analysis to stay on top of issues caused by changes to source data formats and structures, a common problem in today’s increasingly dynamic data environments.

When data is changed, there may be unintended consequences downstream. By working forward from the point of data creation or collection, a data steward can rely on data lineage documentation to help trace data dependencies and identify processing stages that are affected by the changes. Those stages can then be reengineered to accommodate the changes and ensure that data remains consistent in different systems.

Key details on data lineage and how it works — This summarizes what data lineage is and what data lineage tools do.

Key data lineage techniques

Various techniques can be used to collect and document data lineage information. They aren’t necessarily mutually exclusive — an organization may use more than one lineage technique, depending on its application needs and the nature of its data environment. The available methods include the following:

Data tagging. By examining metadata, tags can be applied to data sets to help describe and characterize them for data lineage purposes. Tagging can be done manually by data stewards, other data governance team members and end users or automatically by software. For example, data lineage tools and lineage functionality built into data governance software often include automated algorithms that users can run to tag data sets.
Pattern-based lineage. This approach looks for patterns in multiple data sets, such as similar data elements, rows and columns. Their presence indicates that data sets are related to each other and may be part of a data flow, while differences in data values or attributes are a sign that the data was transformed when it moved from one system to another. The data transformations and data flows can then be documented as part of data lineage records.
Parsing-based lineage. In this case, data lineage tools parse data transformation logic, runtime log files, data integration workflows and other data processing code to identify and extract lineage information. Parsing offers an end-to-end approach to tracing data lineage in different systems and can be more accurate than pattern-based lineage, but it’s also more complex.

Another approach is fully manual: interviewing business users, BI analysts, data scientists, data stewards, data integration developers and other workers about how data moves through systems and gets used and modified. The information that’s gathered can be used to map out data flows and transformations, perhaps as a starting point for a data lineage initiative before implementing more automated techniques.

Read more …

add a comment