Oct. 22, 2022 at 6:19 am

Make data usability a priority on data quality for big data

Jerome Lee2022-10-222022-10-23no commentNo tags

120views

Make data usability a priority on data quality for big data

To help make big data analytics applications more effective, IT teams must augment conventional data quality processes with measures aimed at improving data usability for analysts.

By David Loshin, Knowledge Integrity Inc. published: 17 Oct 2022

Data quality processes continue to become more prominent in organizations, often as part of data governance programs. For many companies, the growing interest in quality is commensurate with an increased need to ensure that analytics data is trustworthy and useable.

That’s especially true with data quality for big data; more data usually means more data problems — including data usability issues that should be part of the quality discussion.

One of the main challenges of effective data quality management is articulating what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity.

But there are many different lists of dimensions, and even some common terms have different meanings from list to list. As a result, solely relying on a particular list without having an underlying foundation for what you’re looking to accomplish is a simplistic approach to data quality.

This challenge becomes more acute with big data. In Hadoop clusters and other big data systems, data volumes are exploding and data variety is increasing. An organization might accumulate data from numerous sources for analysis — for example, transaction data from different internal systems, clickstream logs from e-commerce sites and streams of data from social networks.

Time to rethink things

In addition, the design of big data platforms exacerbates the potential problems. A company might create data in on-premises servers, syndicate it to cloud databases and distribute filtered data sets to systems at remote sites. This new world of processing data creates issues that aren’t covered in conventional lists of data quality dimensions.

To compensate, reexamine what is meant by quality in the context of a big data analytics environment. Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren’t accurate or up to date.

But the task of managing data quality for big data is also likely to include measures designed to help data scientists and other analysts figure out how to effectively use what they have. In other words, they must transition from simply generating a black-and-white specification of good versus bad data to supporting a spectrum of data usability.

Read more …

add a comment