6 dimensions of data quality boost data performance

6 dimensions of data quality boost data performance
Generate accurate data analysis and predictions by mastering the six dimensions of data quality — accuracy, consistency, validity, completeness, uniqueness and integrity.
Artificial intelligence and machine learning can generate quality predictions and analysis, but first require organizations be trained on high quality data, starting with the six dimensions of data quality.
The old adage of computer programming — garbage in, garbage out — is just as applicable to today’s AI systems as it was to traditional software. Data quality means different things in different contexts, but, in general, good quality data is reliable, accurate and trustworthy.
“Data quality also refers to the business’ ability to use data for operational or management decision-making,” said Musaddiq Rehman, principal in the digital, data and analytics practice at Ernst & Young.
In the past, ensuring the quality of data meant a team of human beings would fact-check data records, but as the size and number of data sets increases, this becomes less and less practical and scalable.
Many companies are starting to use automated tools, including AI, to help with the problem.
By the end of this year, 60% of organizations will leverage machine-learning-enabled data quality technology in order to reduce the need for manual tasks, according to Gartner. To maximize these data quality tools, mastering the six dimensions of data quality should help ensure effective data performance.
1. Accuracy
Data accuracy is all about whether the data in the company systems matches that out in the real world or another verifiable source.
“For an accuracy metric to provide valuable insights, there is typically the need for reference data to verify its accuracy,” said Rehman.
For example, vendor data could be checked against a third-party data supplier database, or invoice amounts typed into a bookkeeping system could be checked against the paper documents.
The biggest bane when it comes to accuracy is human data entry, whether it’s employees or customers themselves doing the typing.
“One letter separates the postal abbreviations for Alabama, Alaska and Arkansas,” said Doug Henschen, vice president and principal analyst at Constellation Research. “One digit difference in an address or phone number makes the difference in being able to connect to a customer.”
Even with all the recent progress in digitizing back-end systems and improving customer-facing interfaces, systems are still vulnerable to errors, he said. Good user interface design can help a lot here.
For example, many customer-facing address forms have a built-in address checker to confirm that an address does, in fact, exist. Similarly, credit card numbers and email addresses can be checked at the time of the manual entry.
The latest technology on this front is the customer data platform.
“CDPs are primarily designed to resolve identities and tie together information associated with one person to create a single customer record,” Henschen said.
But it can also help to ensure accuracy and keep records up to date as customers change jobs, get married and divorced, move, or get new email addresses. Most data quality tools offer functionality to validate addresses and perform other standard accuracy checks.
They can also be used to profile data, so if someone enters something unexpected, it will send an alert. IBM, SAP, Attacama, Informatica and other leaders in Gartner’s Magic Quadrant for data quality solutions offer AI-powered data quality rule creation with a self-learning engine.
Unfortunately, despite the new technologies coming on board, the accuracy problem is getting worse, not better, according to a survey of nearly 900 data experts released in September by data quality vendor Talend.
For example, the percentage of respondents who said their data was up to date fell dramatically since this time last year. Only 28% rated their data “very good” in timeliness — down from 57% in 2021.
The percentage of respondents who rated their data “very good” on accuracy fell as well, from 46% in 2021 to 39% this year.
2. Consistency
Consistency means data across all systems reflects the same information, and they are in sync with each other across the enterprise.
Consistency can also be a measure of data format-related anomalies, which can be difficult to test and require planned testing across multiple data sets, Rehman said. Different business stakeholders may need to get involved and create a set of standards that would apply for all the data sets, regardless of what business unit they originated in.
“For example, I have changed my address in an organization’s database,” he said. “That should be reflected across all downstream applications that they support.”
Consistency may also reflect external sources as well.
“A marketing provider may use vendor data from a source, but once we change any of the data records mentioned in the marketing provider database, it will not reflect in that provider’s source,” Rehman said.
Ensuring consistently can be difficult to do manually, but can be significantly improved with data quality tools. Automated systems can automatically correlate data across different data sets or ensure that formats are consistent with company standards.
However, consistency has gotten worse over the past year, according to the Talend survey. In 2021, 40% of respondents rated their data “very good” on consistency. This year, only 32% did the same.
3. Validity
Invalid data could throw off any AI trained on that data set, so companies should create a set of systematic business rules to assess validity, Rehman said.
Birthdates are composed of a month, a day and a year. Social security numbers are ten digits long. U.S. phone numbers begin with a three-digit area code. Unfortunately, in most cases, it’s not as simple as deciding on a format for a birth date.
“In many cases business input is required to understand what the required standards are,” he said. “These standards may evolve over time and should be monitored on a recurring basis.”
Data quality tools designed to ensure accuracy and consistency can also ensure the data is valid. Informatica, for example, offers an API to validate addresses for all countries, formats and languages.