Banner Banner

MetricLib: A Modular and Extensible Toolkit for Evaluation of Medical ML Datasets

Martin Seyferth
Katinka Becker
Tobias Schaeffter
Daniel Schwabe
Matthias Boehm

March 24, 2026

Machine Learning (ML) applications are increasingly applied in the healthcare domain, raising the need for so-called trustworthy AI. One fundamental pillar of trustworthy AI is systematic data quality assessment. However, existing ML data quality (DQ) evaluation tools are typically limited to tabular data, lack extensibility and assess only a fraction of data quality aspects. Moreover, many existing tools are unaware of the specific requirements of the given ML task. This prevents them from offering a comprehensive DQ evaluation. To address these issues, we present MetricLib: an extensible toolkit for holistic data quality evaluation of medical ML datasets, based on the theoretical METRIC-framework Creates Report for trustworthy AI in medicine. The toolkit is able to process a range of data modalities in a memory-efficient manner. While a core set of DQ metrics is implemented, MetricLib is easily extensible with custom metrics and therefore allows investigation of use-case-specific requirements. Additionally, by aggregated DQ scores, the tool enables the efficient identification of data quality gaps. For displaying all results through a graphical user interface, MetricLib is complemented by MetricLibUI. The UI enables targeted, fit-for-purpose data quality analysis based on quantitative and qualitative information.