Laboratory Data Quality Evaluation in the Big Data Era
2023; 43(5): 399-400
Ann Lab Med 2023; 43(5): 425-433
Published online April 21, 2023 https://doi.org/10.3343/alm.2023.43.5.425
Copyright © Korean Society for Laboratory Medicine.
Eun-Jung Cho , M.D., Ph.D.1, Tae-Dong Jeong , M.D., Ph.D.2, Sollip Kim , M.D., Ph.D.3, Hyung-Doo Park , M.D., Ph.D.4, Yeo-Min Yun , M.D., Ph.D.5, Sail Chun , M.D., Ph.D.3, and Won-Ki Min, M.D., Ph.D.3
1Department of Laboratory Medicine, Hallym University Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong, Korea; 2Department of Laboratory Medicine, Ewha Womans University College of Medicine, Seoul, Korea; 3Department of Laboratory Medicine, University of Ulsan College of Medicine and Asan Medical Center, Seoul, Korea; 4Department of Laboratory Medicine and Genetics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea; 5Department of Laboratory Medicine, Konkuk University School of Medicine, Konkuk University Medical Center, Seoul, Korea
Correspondence to: Won-Ki Min, M.D., Ph.D.
Department of Laboratory Medicine, University of Ulsan College of Medicine and Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea
Tel: +82-2-3010-4503
Fax: +82-2-478-0884
E-mail: wkmin@amc.seoul.kr
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: To ensure valid results of big data research in the medical field, the input laboratory results need to be of high quality. We aimed to establish a strategy for evaluating the quality of laboratory results suitable for big data research.
Methods: We used Korean Association of External Quality Assessment Service (KEQAS) data to retrospectively review multicenter data. Seven measurands were analyzed using commutable materials: HbA1c, creatinine (Cr), total cholesterol (TC), triglyceride (TG), alpha-fetoprotein (AFP), prostate-specific antigen (PSA), and cardiac troponin I (cTnI). These were classified into three groups based on their standardization or harmonization status. HbA1c, Cr, TC, TG, and AFP were analyzed with respect to peer group values. PSA and cTnI were analyzed in separate peer groups according to the calibrator type and manufacturer, respectively. The acceptance rate and absolute percentage bias at the medical decision level were calculated based on biological variation criteria.
Results: The acceptance rate (22.5%–100%) varied greatly among the test items, and the mean percentage biases were 0.6%–5.6%, 1.0%–9.6%, and 1.6%–11.3% for all items that satisfied optimum, desirable, and minimum criteria, respectively.
Conclusions: The acceptance rate of participants and their external quality assessment (EQA) results exhibited statistically significant differences according to the quality grade for each criterion. Even when they passed the EQA standards, the test results did not guarantee the quality requirements for big data. We suggest that the KEQAS classification can serve as a guide for building big data.
Keywords: Bias, Big data, Biological variation, Data quality, External quality assessment
Research focus on big data in the healthcare systems field has been increasing because the large amounts of data generated in healthcare systems can potentially contribute to population health management and personalized medicine. The growth of healthcare data is associated with the increase in their digital availability [1]. The sources of big data in healthcare include electronic health records, clinical data (medical imaging and laboratory examination), pharmaceutical data, public records, genomic databases, and measurements made by medical devices [2]. Numerous big data projects in the healthcare systems field have focused on clinical decision support, personalized medicine, population health management, cost reduction, and improvement in the quality of healthcare [3].
Big data in healthcare exhibit distinct features, such as heterogeneity, incompleteness, privacy, and data ownership, in addition to the commonly referred “5 V” (volume, velocity, variety, veracity, and value) [3]. The accuracy, completeness, and consistency of such data are crucial to ensure the quality of the output results [4]. Input data of poor quality can lead to poor decision-making and unreliable results.
The quality of laboratory data is important for the retrospective analysis of large amounts of multicenter data in light of the quality issue of big data. According to the U.S. Centers for Disease Control and Prevention, 70% of current medical decisions rely on laboratory test results, showing the important role of clinical laboratories in current healthcare system [5, 6]. As most test results in diagnostic laboratory medicine are quantitative, the equivalence of test results among laboratories is ensured through standardization and harmonization. Despite these efforts, there remains a large bias in test results when the same sample is tested in various laboratories. If the biased test results are included in multicenter big data, the outcome results of big data research using such biased laboratory results are of no use. Thus, it is essential in big data research to assess the quality or accuracy of laboratory data using external quality assessment (EQA) results.
EQA surveys evaluate the quality of test items in a laboratory. As EQA surveys require only minimum quality criteria, for laboratory big data, it is necessary to evaluate EQA results using stricter criteria [7]. We aimed to establish a strategy for evaluating the quality of laboratory results suitable for big data research using Korean Association of External Quality Assessment Service (KEQAS) data as a surrogate for real laboratory data. The acceptance rate of participants and their EQA results were compared considering their quality grade based on the biological variation (BV) or outcome-based criteria for the total error.
This retrospective study was conducted using multicenter EQA results from clinical laboratories. We retrieved KEQAS data of commutable fresh-frozen serum samples from 2010 to 2020 and analyzed more than 30,000 EQA results for seven test items. We categorized the data into three groups depending on whether the measurement procedures had been standardized or harmonized (Fig. 1).
The first group comprised laboratory tests for HbA1c, creatinine (Cr), total cholesterol (TC), and triglyceride (TG) fully standardized in accuracy-based EQAs [8–11]. The target values of these tests were measured using reference measurement procedures in certified reference laboratories [12, 13]. According to the International Consortium for Harmonization of Clinical Laboratory Results, the tests in the second and third groups had maintained their harmonization status or were undergoing harmonization [14]. The second group comprised tests for which relevant international standards exist, including tests for alpha-fetoprotein (AFP) and prostate-specific antigen (PSA). AFP tests were calibrated against the WHO 72/225 International Standard (IS). The PSA tests were calibrated using the WHO 96/670 IS or the Hybritech standard (Beckman Coulter Inc., Brea, CA, USA) [14, 15]. The target values for this group were determined by calculating the mean in accordance with their standards. The third group comprised tests for which harmonization was still ongoing because of the lack of traceable calibrators or the use of various antibodies, such as the cardiac troponin I (cTnI) test. We analyzed the results of major instrument platforms for cTnI that were used by more than 10 EQA survey participants. These platforms included Abbott (Abbott Diagnostics, Abbott Park, IL, USA), Beckman Coulter Inc., LSI Medience (LSI Medience, Chiba, Japan), Radiometer (Radiometer Medical ApS, Brønshøj, Denmark), Roche (Roche Diagnostics, Mannheim, Germany), and Siemens (Siemens Healthineers, Erlangen, Germany). Because these cTnI tests use different calibrators and epitopes, the average value for each manufacturer was considered the target value. The manufacturer names are denoted as letters from A to F.
We selected EQA samples with concentrations close to medical decision levels according to corresponding clinical guidelines and subsequently calculated the absolute percentage bias for each test item (Supplemental Data Table S1) [17–23]. We analyzed 10 samples for HbA1c, Cr, and TC; five for TG and PSA; four for AFP; and eight for cTnI. The analytical performance specifications (APSs) were the optimum, desirable, and minimum goal of total error (TE) based on the BV of the measurand using the latest European Federation of Clinical Chemistry and Laboratory Medicine data [24]. The acceptance criteria are summarized in Supplemental Data Table S2. In addition to the BV, an outcome-based criterion for TE (6.7%) was used for HbA1c [25]. Results that did not meet the minimum criteria (outcome-based criterion for HbA1c) were considered unacceptable.
Finally, our analysis focused on EQA results that met the defined KEQAS performance criteria. For HbA1c, Cr, TC, and TG, the acceptable bias limit was ±6.7%, ±11.4%, ±9%, and ±15%, respectively [25–27]. The AFP, PSA, and cTnI acceptance criteria were established within±3 SD indices.
Microsoft Office Excel 2021 (Microsoft Co., Redmond, WA, USA) and MedCalc version 19.2.6 for Windows (MedCalc Software, Ostend, Belgium) were used for statistical analysis. The mean percentage bias for the BV criteria was compared between groups using one-way ANOVA followed by the Student–Newman–Keuls and Kruskal–Wallis tests and then Dunn’s post-hoc test. Statistical significance was defined as
Fig. 2 shows the acceptance rates and concentrations expressed in National Glycohemoglobin Standardization Program (NGSP) units of 10 HbA1c samples. The concentrations ranged from 5.8% to 7.1%. Conversion between NGSP (%) and International Federation of Clinical Chemistry (IFCC) (mmol/mol) units requires a linear equation: IFCC unit=10.93×NGSP unit–23.5. The mean acceptance rates were 95.2%, 67.5%, 42.9%, and 22.9% within the outcome-based, minimum, desirable, and optimum criteria, respectively.
The mean acceptance rates for the first group for various APSs are presented in Fig. 3A. For Cr, the average acceptance rates for 10 samples with concentrations ranging from 0.66 to 1.40 mg/dL were 70.9%, 56.3%, and 34.4% within the minimum, desirable, and optimum criteria, respectively. The Cr concentration can be converted from mg/dL to the SI unit (µmol/L) by multiplying the value with 88.42. The average acceptance rates for 10 TC samples with concentrations ranging from 197.2 to 246.4 mg/dL were 100.0%, 99.1%, and 86.0%, respectively. TC and TG concentrations can be converted from mg/dL to the SI unit (µmol/L) by multiplying the values with 0.0259 and 0.0113, respectively. The TG data were divided into two groups based on whether or not the test method included free glycerol blanking, and the concentrations of the five samples ranged from 93.3 to 205.0 mg/dL. Within the minimum, desirable, and optimum criteria, the average acceptance rates were 100.0%, 100.0%, and 99.5%, respectively, for the non-free glycerol-blanking method, 100.0%, 100.0%, and 99.7%, respectively, for the free glycerol blanking method, and 100.0%, 100.0%, and 99.6%, respectively, for both methods combined.
The AFP data from the 2019–2020 survey demonstrated that the four samples had concentrations ranging from 11.6 to 87.8 ng/mL, which were close to the clinical threshold. Based on the minimum, desirable, and optimum criteria, the average acceptance rates were 99.8%, 99.6%, and 94.7%, respectively.
The PSA data were divided into two groups according to the calibrator. According to the WHO 96/670 IS, the average acceptance rates for five samples with concentrations ranging from 4.011 to 12.989 ng/mL were 99.1%, 92.4%, and 60% within the minimum, desirable, and optimum criteria, respectively. According to the Hybritech standard, the average acceptance rates for five samples with concentrations ranging from 4.176 to 14.329 ng/mL were 98.4%, 89.9%, and 50.3% within the minimum, desirable, and optimum criteria, respectively (Fig. 3B).
We analyzed the survey data for cTnI using samples with values close to the concentrations at which the CV was 20% in the manufactures’ package inserts. The concentrations of the eight samples ranged from 0.106 to 2.006 ng/mL. The mean acceptance rates for the six manufacturers are shown in Fig. 3C. The mean acceptance rates within the minimum criteria for TE were >95.0% for all manufacturers, except one (F; 89.1%). For manufacturers A, B, and D, the mean acceptance rates were all >95.0% within the desirable bounds. The mean acceptance rates within the optimum criteria were 92.0%, 82.5%, 72.9%, 91.4%, 77.4%, and 50.4% for manufacturers A–F, respectively.
Figs. 4 and 5 and Supplemental Data Fig. S1 show box and whisker plots of the mean percentage bias for all analyte items according to the performance criteria. The mean percentage bias for the BV criteria showed significant differences between the groups (
Mean percentage bias according to analytical performance criteria
Test item | Mean percentage bias (95% CI) according to different criteria | ||||
---|---|---|---|---|---|
Optimum | Desirable | Minimum | Outcome-based | Unacceptable | |
HbA1c | 0.6 (0.6–0.6) | 1.0 (1.0–1.0) | 1.6 (1.6 –1.7) | 2.6 (2.5–2.6) | 8.8 (8.6–9.1) |
Cr | 1.8 (1.7–1.9) | 3.1 (3.0–3.3) | 4.3 (4.1–4.5) | 20.0 (19.1–20.9) | |
TC | 1.7 (1.6–1.7) | 2.1 (2.0–2.2) | 2.1 (2.1–2.2) | ||
TG | 2.9 (2.8–3.1) | 3.0 (2.8–3.1) | |||
AFP | 5.6 (5.4–5.7) | 6.4 (6.2–6.6) | 6.5 (6.2–6.7) | 79.6 (49.3–109.9) | |
PSA-Hybritech calibrator | 3.9 (3.0–4.7) | 7.4 (6.7–8.0) | 8.4 (7.8–9.0) | 28.8 (23.8–33.8) | |
PSA-WHO calibrator | 3.7 (3.5–4.0) | 6.4 (6.3–6.6) | 7.3 (7.1–7.5) | 28.3 (26.5–30.1) |
Abbreviations: Cr, creatinine; CI, confidence interval; TC, total cholesterol; TG, triglyceride; AFP, alpha-fetoprotein; PSA, prostate-specific antigen.
The mean percentage bias did not significantly differ between the two calibrator types based on the APS groups for PSA (
Big data research using unreliable laboratory results can result in poor medical decisions, improper risk stratification, inappropriate management, and increased costs for the patient [28, 29]. We investigated the eligibility of test results that met the EQA criteria to be included in big data based on the BV or outcome-based criteria.
We selected seven test items that were measured using commutable frozen human serum pools in the KEQAS program. According to the test item, the acceptance rates for EQA results were 67.5%–100%, 42.9%–100%, and 22.9%–99.5% within the minimum, desirable, and optimum criteria, respectively. Among the seven test items, HbA1c and Cr showed low acceptance rates. Based on the minimum criteria, the mean acceptance rates for HbA1c and Cr were 67.5% and 70.9%, respectively, which we attribute to the minimum criteria for HbA1c (3.3%) and Cr (11.1%) being lower than the KEQAS acceptable bias criteria for HbA1c (6.7%) and Cr (11.4%). The acceptance rate was analyzed using all participants with acceptable and unacceptable results in KEQAS in this study. Therefore, few participants showed mean percentage biases between 11.1% and 11.4%.
The minimum criterion for HbA1c was 3.3%, which is 15.8 times more stringent than that of APF (52.2%); therefore, it had the lowest acceptance rate among the seven test items. The minimum criterion of Cr was 11.1%, which is three times higher than that of HbA1c, while the acceptance rate was comparable to that of HbA1c, which can be attributed to the analytical interference in routine Cr methods. Because the minimum criterion for Cr is high, it is necessary to deduce through discussions with data scientists whether any of the criteria based on the BV can be applied to the big data criterion for Cr. Unlike for Cr, it may be possible to use the minimum criterion for HbA1c for use in big data, unless HbA1c big data require very high accuracy.
The BV criteria of TC were similar to those of Cr, but its acceptance rates were 1.4, 1.8, and 2.5 times higher than those of Cr for the minimum, desirable, and optimal criteria, respectively. Unlike that for Cr, the KEQAS acceptable bias criterion for TC was 9%, which was higher than the desirable criterion (8.7%) and lower than the minimum criterion (13.0%). The difference in medical decision levels between Cr (0.7–1.0 mg/dL) and TC (200–240 mg/dL) was another factor contributing to the higher acceptance rate of TC. Even when the absolute difference was small, a low Cr value was more likely to cause a large relative difference (%bias). Fully automated enzymatic methods, less interference, and standardization of measurement procedures for cholesterol quantification were additional contributing factors. The mean percentage biases for TC were <3% based on the minimum, desirable, and optimum criteria. Given the high acceptance rates and low mean percentage biases for TC based on all criteria, we can apply any criterion according to the needs in term of accuracy and size of the TC big data.
TG showed a substantially wider BV criterion than other lipids owing to its high intraindividual BV, which is approximately three times that of TC [30]. The optimum criterion for TG was 13.5%, which is close to the KEQAS acceptable bias criterion for TC (15%). Few EQA results for TG were outside the optimum criterion, regardless of free glycerol blanking. Therefore, for TG, it is crucial to set a new criterion other than the BV-based criterion for use in big data.
The average acceptance rate of AFP was approximately 95.0% based on the optimum criterion (17.4%). Despite using target values for AFP that were derived from all methods used in different platforms, the acceptance rate was high. The mean percentage bias for AFP did not significantly differ (approximately 6%) among the three BV groups. The average acceptance rates of the two PSA calibrator types were approximately 60.0% based on the optimum criterion, which was lower than that of AFP. The difference in the BV criteria was one cause of this discrepancy; the optimum criterion of PSA was 8.1%, which is less than half of that of AFP (17.4%). Accordingly, the mean percentage bias of PSA was <4.0% and that of AFP was 5.6%. The WHO calibrator yields 2%–14% lower PSA value than the Hybritech calibrator [31, 32]. According to the optimum, desirable, and minimum criteria, there were 3.0%, 14.3%, and 14.7% differences, respectively, in mean percentage bias between the two calibrator types. Therefore, it is essential to construct big data according to the type of calibrator used in the PSA test. Recently, the APSs derived from state-of-the-art tests were shown to be the most suitable because of the lack of high-quality BV data for tumor markers [33]. The average acceptance rate for PSA was approximately 90.0% when applying the 15% criterion recommended in earlier studies [33, 34]. Further research is needed to decide the criteria to be set for AFP or PSA big data.
The acceptance rates for cTnI were 89.1%–99.3%, 81.7%–99.4%, and 50.4%–92.0% for the minimum, desirable, and optimum criteria, respectively. The results varied among manufacturers owing to differences in calibration and antibody specificity [35]; moreover, the bias was calculated using the mean value of each instrument peer group. However, the acceptance rate varied significantly among peer groups; particularly, the acceptance rate according to the optimal criteria ranged from 50.4% to 92.0%. If the overall mean of the six cTnI tests was used as target value and all EQA results were simultaneously analyzed, the acceptance rates based on the BV criteria were 44.3%, 33.0%, and 17.2% for the minimum, desirable, and optimum criteria, respectively. For the TnI test, acceptance rates should be determined separately for the different manufacturers. The mean percentage biases for cTnI among six platforms were 3.9%–5.5%, 4.8%–9.6%, and 4.8%–11.3% for the optimum, desirable, and minimum criteria, respectively. The mean percentage bias was significant for the desirable and minimum criteria but relatively small for the optimum criterion. The results in the unacceptable group showed a mean percentage bias of 40.9%–99.0% among the six platforms. For big data construction, one must consider the platform used for the TnI test to improve clinical outcomes in patients with various cardiovascular conditions. Further research may be needed to decide an outcome-based criterion for TnI big data.
There have been numerous studies on standardizing terms, result formats, statistical techniques, and data categorization or mapping tools to improve the quality of big data [23, 36–38]. However, big data researchers, not clinical pathologists, may wrongly believe that all quantified data can be aggregated without any quality-assurance checks [39, 40]. We used EQA results as a surrogate for real laboratory data, and we compared and analyzed participants’ EQA results considering their quality grade based on the TE, which revealed statistically significant differences. Even test results that passed the EQA did not guarantee the quality for inclusion in big data. Therefore, in big data research, it is essential for laboratory medicine experts to ensure that the data meet quality standards; particularly, the reliability of test results should be considered [12]. Big data should be classified according to the state of harmonization or standardization; however, no study has been conducted on this. EQAs evaluate test results using categorization based on the standardization or harmonization status. This classification can guide building big data for each test item.
One potential study limitation is that we only used BV based on the test items as the acceptance criteria. Because standards or guidelines for QC of laboratory data are lacking, further research is needed to establish criteria and evaluate the data quality according to test items, test characteristics, and the purpose and amount of big data. According to Kim,
None.
Cho EJ wrote the manuscript and produced the tables and figures; Jeong TD, Kim S, and Min WK revised the manuscript; Park HD, Yun YM, and Chun S conceived and designed the study; Min WK supervised the study. All authors reviewed and approved the final version of the manuscript.
None declared.
None declared.