Friday, June 29, 2018

Data Quality - when the back side matters

Author: Yuliia Puzanova


We can develop more advanced predictive models, to invest the huge funds in machine learning algorithms, use more and more hype technologies and as a result, make a wrong decision. Why so? Because our data can be a garbage. And the biggest issue here is that we can even have no idea about that. Data Credibility problem now is more often discussed among companies, countries and World Organizations s.t. World Bank and World Health Organization [1].


The sadness is that in most of the companies employees waste plenty of time looking for errors and trusted sources for data, or, simply saying - they are conducting double, triple check of each figure which they report.

Low quality of the data means wrong decisions in the future which implies economic development slowing down. And if we will estimate the overall losses - the amount of them can be tremendous. IBM estimates that only for US economy the costs of poor data quality is nearly $3.1 trillion per year [2].

However, even such amounts which can be lost by companies do not improve considerably their behavior and approaches in order to increase the quality of their data. The experiment conducted by HBR showed that only 3% of participants have the minimum acceptable data quality [3].


Low data quality should be a nightmare of companies management board because it implies not only in wrong figures in the year or monthly reports, decrease in profitability, considerable fines, but also tends to losses of customers or damage of reputation. 

To be sure that we have highly qualitative data we should look over the following checklist and main characteristics of the desirable data condition [4].

Availability: is the data available? Is it actual? 
Accessibility: do we have an access to the data? What are the conditions for that?
Usability: can we really use this data?
Structure: data should be good structured and ‘ready-to-be-used'
Reliability: can we trust this data?
Consistency and completeness: are the data complete or there are some gaps? 

Data quality management is very close to system administration. The main functions of Data quality management are to look for inconsistencies, errors, data types/formats errors, gaps and correcting them. Also, it is looking for the duplicates and trying to eliminate them. If all Data quality processes are running in a good and effective way, then we can be sure that data, used for business analytics, reporting, analysis and regulatory compliance, gives us trusted results.  

It is always better to make one step back and be sure in quality of the data because the programming skills and the best Data Scientists from the labor market won’t help in a case when the foundation was built with mistakes - the house will fall anyway. 

But we also should remember to evaluate the data quality not only for the private sector. The public economy data is not less important. For example, there are a lot of discussions regarding the countries’ GDP and their truthfulness. And it is not a one-hour exercise to check whether we can trust to reported figures or not. The reliability of GDP data is highly important for macroeconomic and financial policy analysis. And at the same time, the high standards set by international organizations are achieved only by few countries.

The IMF developed the Data Quality Assessment Framework (DQAF). It is based on the UN Fundamental Principles of Official Statistics and indicates the 5 data quality dimensions: assurances of integrity, methodological soundness, accuracy and reliability, serviceability, and accessibility. The DQAF provide to public authorities the guideline for improvement of the statistics quality and gives an ability to countries to verify their statistical methodologies and integrity objectively.

The integrity principle means that statistical practices are transparent and guided ethically and by professional principles [5]. 

Methodological soundness concept assures that all main definitions are used accordingly to the internationally defined and are accepted by global standards [5]. 

Accuracy and reliability of the data mean that the source is trusted, it is assessed and validated on an ongoing basis, and statistical outputs are validated [5]. 

Serviceability of the data takes place if statistics cover all relevant information and it is consistent [5].

The data satisfies the accessibility principle if it is presented in a clear and understandable manner, it is up-to-date and the metadata is available [5].

Overall, the DQAF can be used not only for self-assessment by National Statistical Offices but by researchers as well for evaluation of the data used for policy analysis, economic performance or forecasts. 

As a conclusion, we can say that the benefits of improving data quality go even far from just reducing costs or eliminating economic losses. We do not have rights to discuss the great perspectives of Big Data if we will continue to assign the second priority for the data quality issues. We should stop saying to ourselves “I will fix it later” in case of finding the small mistake in figures.

No comments:

Post a Comment