Article

Opinion

Ann Lab Med 2024; 44(6): 472-477

Published online July 17, 2024 https://doi.org/10.3343/alm.2024.0084

Copyright © Korean Society for Laboratory Medicine.

Customized Quality Assessment of Healthcare Data

Jieun Shin , Ph.D.1,2,3 and Jong-Yeup Kim, M.D., Ph.D.1,2,4

1Department of Biomedical Informatics, College of Medicine, Konyang University, Daejeon, Korea; 2Konyang Medical Data Research group-KYMERA, Konyang University Hospital, Daejeon, Korea; 3Healthcare Data Verification Center, Konyang University Hospital, Daejeon, Korea; 4Department of Otorhinolaryngology–Head and Neck Surgery, College of Medicine, Konyang University Hospital, Daejeon, Korea

Correspondence to: Jong-Yeup Kim, M.D., Ph.D.
Department of Biomedical Informatics, College of Medicine Konyang, University Hospital, 158, Gwanjeodong-ro, Seo-gu, Daejeon 35365, Korea
E-mail: jykim@kyuh.ac.kr

Received: February 15, 2024; Revised: May 17, 2024; Accepted: June 26, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Data are a valuable resource in industrial societies and are becoming increasingly important in the healthcare field. The relevance of big data analysis in the healthcare industry is growing because of paradigm shifts in healthcare services, increasing social needs, and technological advancements. Unlike typical big data characterized by the “5Vs” (volume, variety, velocity, value, and veracity), healthcare data are characterized by heterogeneity, incompleteness, timeliness, longevity, privacy, and ownership [1]. In particular, healthcare data comprise heterogeneous (multimodal) data types, including structured data as well as unstructured data, such as computed tomography scans, magnetic resonance images, and X-ray images [1]. Additionally, numerous challenges exist in terms of data collection in the healthcare domain, including issues related to privacy protection and ownership, and in terms of healthcare data management, including data storage, sharing, and security [1]. Given the inherent complexity of healthcare data, which are not collected for research or industrial purposes, the lack of systematic data collection methods makes it difficult to ensure data quality.

The definition of “data quality” varies, but the most widely accepted definition is “fitness for use” [2], which refers to data that are current, accurate, interconnected, and provide tangible value to users. Low-quality data can lead to operational losses, delays, and increased costs associated with data cleansing, potentially resulting in higher prices for products or services in data-related industries [3]. Additionally, the quality of data can significantly impact the performance of artificial intelligence (AI) models [4]. Particularly in healthcare, the use of low-quality data in areas such as treatment, surgery, research, and policy decisions can result in significant losses. In the United States, the use of big data in healthcare may save approximately \$300 billion in economic costs annually [5]. In this context, low-quality healthcare data pose a threat to public life and health and can lead to inefficiencies in the healthcare system. Therefore, high-quality healthcare data are essential for ongoing healthcare data research.

Various data management quality methods are being researched and reported. In 2003, the WHO published guidelines to enhance the quality of healthcare data [6]. These guidelines provide directives for healthcare professionals and information managers, enhancing understanding of all aspects of data collection and creation. In Korea, organizations such as the National Information Society Agency (NIA) [7] have established guidelines and directives for data quality management, and the Korea Data Industry Promotion Agency (KData) (http://dataqual.co.kr/bbs/content.php?co_id=quality_concept) has been operating a data quality certification system since 2006. This highlights the national and social recognition of the need for systematic management to consistently maintain and improve data quality from the user’s perspective.

The NIA is building and releasing AI learning data. After quality assessment by the Telecommunications Technology Association (TTA)—an external organization that does not build the data—according to the guidelines and directives for data quality management, the data are released via AI-Hub (https://aihub.or.kr). In big data research using laboratory data, external quality assessment results are used to evaluate data quality or accuracy [8, 9]. However, current national and international research on healthcare data quality lacks clearly defined standards or criteria. Specifically, standards and quality management criteria reflecting the unique characteristics of healthcare data, such as imaging information and biometric signals, remain insufficient. This gap highlights the need to develop quality management measures and ensure the production and utilization of high-quality healthcare data.

Research on quality indicators and systems that align with the specific nature of healthcare data are limited [10, 11]. Furthermore, healthcare data are not only voluminous but also varied in type, form, and attributes [12]. Therefore, a quality management system suited to the characteristics of healthcare data is necessary for effective management and utilization. We propose three stages of quality management direction (Fig. 1), as follows.

Figure 1. Three stages of quality management direction for customized healthcare data.

First, customized quality assessment indicators for healthcare data must be established. Numerous studies on quality indicators for measuring data quality have been conducted [11]. However, the terminology and definitions of quality assessment indicators lack consistency, leading to confusion and trial-and-error because of the adoption of different quality elements and measurement items. Hwang, et al. [12] searched the literature from 1990 to 2023 using the search terms “data quality,” “data quality assessment,” and “data quality dimensions” in Google Scholar to examine the diversity of quality indicators. Their search yielded 23 publications with 80 different data quality assessment indicators with various definitions. This implies that the data quality assessment indicators used in research use the same term with different definitions or different terms with the same definition (Table 1). Therefore, a consensus on the definition of data quality assessment indicators is needed to quantify data quality. In particular, quality assessment indicator research that reflects the characteristics of healthcare data is needed. In other words, to ensure the validity and reliability of healthcare data quality assessment, it is first necessary to clearly conceptualize the definitions of terms used in measuring quality. Therefore, further studies are required to redefine customized quality assessment criteria for healthcare data.

Data quality evaluation indicators collected by Hwang, et al. [12] via a literature search
Group*IndicatorsDefinitions
1CoherenceThe extent to which data are consistent over time and across providers
ComplianceThe extent to which data adhere to standards or regulations
ConformityThe extent to which data are presented following a standard format
ConsistencyThe extent to which data are presented following the same rule, format, and/or structure
DirectionalityThe extent to which data is consistently represented in the graph
IdentifiabilityThe extent to which data have an identifier, such as a primary key
IntegrabilityThe extent to which data follow the same definitions so that they can be integrated
IntegrityThe extent to which the data format adheres to criteria
IsomorphismThe extent to which data are modeled in a compatible way
JoinabilityWhether a table contains a primary key of another table
PunctualityWhether the data are available or reported within the promised time frame
Referential integrityWhether the data have unique and valid identifiers
Representational adequacyThe extent to which operationalization is consistent
StructurednessThe extent to which data are structured in the correct format and structure
ValidityThe extent to which data conform to appropriate standards
2AmbiguityThe extent to which data are presented properly to prevent data from being interpreted in more than one way
ClarityThe extent to which data are clear and easy to understand
ComprehensibilityThe extent to which data concepts are understandable
DefinitionThe extent to which data are interpreted
GranularityThe extent to which data are detailed
InterpretabilityThe extent to which data are defined clearly and presented appropriately
NaturalnessThe extent to which data are expressed using conventional, typified terms and forms according to a general-purpose reference source
Presentation, ReadabilityThe extent to which data are clear and understandable
UnderstandabilityThe extent to which data have attributes that enable them to be read, interpreted, and understood easily
VaguenessThe extent to which data are unclear or unspecific
3AccuracyThe extent to which data are close to the real-world or correct value(by experts)
BelievabilityThe extent to which data are credible
CorrectnessThe extent to which data are true
CredibilityThe extent to which data are true and correct to the content
PlausibilityThe extent to which the data make sense based on external knowledge
PrecisionThe extent to which data are exact
ReliabilityWhether the data represent reality accurately
TransformationThe error rate due to data transformation
TypingWhether the data are typed properly
VerifiabilityThe extent to which data can be demonstrated to be correct
4Concise representationThe extent to which data are represented in a compact manner
ComplexityThe extent of data complexity
RedundancyThe extent to which data have a minimum content that represents the reality
5CurrencyThe extent to which data are old
FreshnessThe extent to which replica of data are up-to-date
TimelinessThe extent to which data are up-to-date
DistinctnessThe extent to which duplicate values exist
DuplicationThe extent to which data contain the same entity more than once
UniquenessThe extent to which data have duplicates
6Ease of manipulationThe extent to which data are applicable according to a task
RectifiabilityWhether data can be corrected
VersatilityThe extent to which data can be presented using alternative representations
7AccessibilityThe extent to which data are retrieved easily and quickly
AvailabilityThe extent to which data can be accessed
8AuthorityThe extent to which the data source is credible
LicenseWhether the data source license is clearly defined
ReputationThe extent to which data are highly regarded in terms of their source or content
9CohesivenessThe extent to which the data content is focused on one topic
FitnessThe extent to which data match the theme
10ConfidentialityThe extent to which data are for authorized users only
SecurityThe extent to which data are restricted in terms of access
11PerformanceThe latency time and throughput for coping with data with increasing requests
Storage penaltyThe time spent for storage
12HistoryThe extent to which the data user can be traced
TraceabilityThe extent to which access to and changes made to data can be traced
13Appropriate amount of dataThe extent to which the data volume is appropriate for the task
14CompletenessThe extent to which data do not contain missing values
15ConcordanceThe extent to which there is agreement between data elements (E.g., diagnosis of diabetes, but all A1C results are normal)
16ConnectednessThe extent to which datasets are combined at the correct resource
17FragmentationThe extent to which data are in one place in the record
18ObjectivityThe extent to which data are not biased
19ProvenanceWhether data contain sufficient metadata
20VolatilityHow long the information is valid in the context of a specific activity
21VolumePercentage of values contained in data with respect to the source from which they are extracted
22CleannessThe extent to which data are clean and not polluted with irrelevant information, not duplicated, and formed in a consistent way
23NormalizationWhether data are compatible and interpretable
24Referential correspondenceWhether the data are described using accurate labels, without duplication
25AppropriatenessThe extent to which data are appropriate for the task
26EfficiencyThe extent to which data can be processed and provide the expected level of performance
27PortabilityThe extent to which data can be preserved in existing quality under any circumstance
28RecoverabilityThe extent to which data have attributes that allow the preservation of quality under any circumstance
29RelevancyThe extent to which data match the user requirements
30UsabilityThe extent to which data satisfy the user requirements
31Value-addedThe extent to which data are beneficial

*Indicators with similar meaning were grouped into one group.

Searched using the search terms “data quality,” “data quality assessment,” and “data quality dimensions” in Google Scholar (publication period: 1990–2023). The 80 data quality assessment indicators were obtained from 23 reports [12].



Second, a method for quantifying customized quality indicators for healthcare data and benchmarks for assigning quality evaluation scores must be established. For example, Lee and Shin [13] reported that labeling accuracy, one of the quality indicators, can be set at different levels depending on the similarity among the characteristic variables of the data and the level of class imbalance. This implies that the level of quality indicators for healthcare data quality assessment should be determined based on characteristics such as data similarity, sample size, and imbalance. Data quality indicators are evaluated both quantitatively and qualitatively. Quantitative evaluation is based on factors such as data completeness, validity, consistency, and accuracy, and quality levels are defined in four classes (ace, high, middle, low) using the Six Sigma concept [7]. In qualitative evaluation, evaluators subjectively judge criteria through a checklist for each indicator by answering yes-or-no questions [7].

Additionally, the measurement of quality indicators varies depending on the data type (structured/unstructured). As healthcare data are multimodal, it is necessary to review the appropriateness of the criteria for quantifying quality indicators considering the data characteristics. Therefore, further studies are required to establish customized quality assessment criteria for healthcare data.

Third, a unified quality score presentation should be established to facilitate an intuitive understanding of data quality by the end user. Moges, et al. [14] analyzed the relative importance of quality indicators by surveying private companies in various countries, which revealed that the importance of quality items varies by industry. The relative importance of quality indicators for healthcare data can be analyzed to establish a unified quality score. Hence, studies on calculating a unified score through the quantification of quality assessments are needed.

Data quality assessment requires a multidisciplinary approach to reflect the demand for sustainable, high-quality data across various domains. Therefore, research on healthcare data quality should not be limited to specific domain knowledge areas but can be conducted as a joint study by experts with knowledge in various fields such as medicine, data science, statistics, and information technology. A creative approach to data quality necessitates effective interdisciplinary collaboration among experts from various fields [15].

In conclusion, we emphasize the necessity of high-quality data and present a direction for establishing quality indicators and systems suitable for the characteristics of healthcare data. Developing a clear and consistent definition of data quality and systematic methods and approaches for data quality assessment require more extensive research.

Shin J and Kim JY contributed to the study conceptualization, methodology, investigation, visualization, and project administration; Kim JY acquired funding and supervised the study; Shin J wrote the original draft; and Kim JY reviewed and edited the paper. Both authors have read and approved the final manuscript.

This study was conducted as part of the National Balanced Development Special Account K-Health National Medical AI Service and Industrial Ecosystem Construction Project funded by the Ministry of Science and ICT and the Korea Information and Communications Promotion Agency (grant No. H0503-24-1001).

  1. Hong L, Luo M, Wang R, Lu P, Lu W, Lu L. Big data in health care: applications and challenges. 2018;2:175-97.
    CrossRef
  2. Nikiforova A. Definition and evaluation of data quality: user-oriented data object-driven approach to data quality assessment. Balt J Mod Comput 2020;3:391-432.
    CrossRef
  3. Dasu T, Johnson T. Data quality: techniques and algorithms. In: Shewart WA, Wilks SS, eds. Exploratory data mining and data cleaning. New York: John Wiley & Sons, 2003:139-88.
    CrossRef
  4. Bernardi FA, Alves D, Crepaldi N, Yamada DB, Lima VC, Rijo R. Data quality in health research: integrative literature review. J Med Internet Res 2023;25:e41446.
    Pubmed KoreaMed CrossRef
  5. Kim J, Kim H, Son K, Song Y, Yoon J, Lim H, et al. Medical utilization of big data. Inf Sci Manag 2014;32:18-26.
    CrossRef
  6. WHO. Improving data quality: a guide for developing countries. 2003. https://iris.who.int/handle/10665/206974.
  7. National Information Society Agency (NIA). Big data platform and center data quality management guideline v3.1. 2024. https://aihub.or.kr/aihubnews/qlityguidance/view.do?pageIndex=1&nttSn=10269&currMenu=&topMenu=&searchCondition=&searchKeyword=.
  8. Cho EJ, Jeong TD, Kim S, Park HD, Yun YM, Chun S, et al. A new strategy for evaluating the quality of laboratory results for big data research: using external quality assessment survey data (2010-2020). Ann Lab Med 2023;43:425-33.
    Pubmed KoreaMed CrossRef
  9. Kim S, Cho EJ, Jeong TD, Park HD, Yun YM, Lee K, et al. Proposed model for evaluating real-world laboratory results for big data research. Ann Lab Med 2023;43:104-7.
    Pubmed KoreaMed CrossRef
  10. Bae SH, Lim IH. A study on 3G networked pulse measurement system using optical sensor. J Kor Inst Electron Commun Sci 2012;7:1555-60.
  11. Hinrichs H. Datenqualitätsmanagement in data warehouse-systemen. Doctoral dissertation. Universität Oldenburg. 2002.
    CrossRef
  12. Hwang P, Lee W, Ryu K, Jung W, Shim S, Kim JY, et al. Research for Data Quality Dimensions of Medical Data. 2023 Fall Academic Conference of the Korean Society of Medical Informatics; Nov 30, 2023; Gyeonggi-do, Korea. https://www.kosmi.org/bbs/download.php?bo_table=sub4_2&wr_id=86&no=4 (Updated on Feb 2024)
    CrossRef
  13. Lee JH, Shin J. AI performance based on learning-data labeling accuracy. J Ind Converg 2024;22:177-83.
    CrossRef
  14. Moges HT, Dejaeger K, Lemahieu W, Baesens B. A multidimensional analysis of data quality for credit risk management: new insights and challenges. Inf Manag 2012;50:43-58.
    CrossRef
  15. Keller S, Korkmaz G, Orr M, Schroeder A, Shipp S. The evolution of data quality: understanding the transdisciplinary origins of data quality concepts and approaches. Annu Rev Stat Appl 2017;4:85-108.
    CrossRef