Improving data quality in large-scale repositories through conflict resolution

Date

2021-10

Journal Title

Journal ISSN

Volume Title

Publisher

Springer

Abstract

Digital repositories rely on technical metadata to manage their objects. The output of characterization tools is aggregated and analyzed through content profiling. The accuracy and correctness of characterization tools vary; they frequently produce contradicting outputs, resulting in metadata conflicts. The resulting metadata conflicts limit scalable preservation risk assessment and repository management. This article presents and evaluates a rule-based approach to improving data quality in this scenario through expert-conducted conflict resolution. We characterize the data quality challenges and present a method for developing conflict resolution rules to improve data quality. We evaluate the method and the resulting data quality improvements in an experiment on a publicly available document collection. The results demonstrate that our approach enables the effective resolution of conflicts by producing rules that reduce the number of conflicts in the data set from 17 to 3%. This replicable method for presents a significant improvement in content profiling technology for digital repositories, since the enhanced data quality can improve risk assessment and preservation management in digital repository systems.

Description

Keywords

Citation

Kulmukhametov, A., Rauber, A. & Becker, C. Improving data quality in large-scale repositories through conflict resolution. Int J Digit Libr (2021). https://doi.org/10.1007/s00799-021-00311-0

DOI

10.1007/s00799-021-00311-0

ISSN

1432-5012

Creative Commons

Attribution 4.0 International

Items in TSpace are protected by copyright, with all rights reserved, unless otherwise indicated.