Banner Banner

What is “Good” Training Data? - Data Quality Dimensions that Matter for Machine Learning

Felix Neutatz
Ziawasch Abedjan

December 01, 2022

Artificial intelligence (AI) today is heavily relying on training data, which is used to capture important facets of known reality to produce predictions for unknown events. Such AI systems find applications in a wide range of use cases from the personalization of entertainment to medical diagnosis. The affected subjects of such applications are inadvertently humans. Hence, it is important to reason about the reliability of such systems and assess the harm they can inflict on humans and society. One of the known dangers of data-driven AI is its potential to amplify societal bias into the predictions, which will be disproportionately harmful to demographic minorities. In this paper, we will give an overview of the types of bias in data and the way they can be introduced into AI systems. Then we will present existing technologies that focus on identification and reduction of such bias. We show that the treatment of bias in AI applications requires domain-specific awareness of the types of bias and holistic treatment of the underlying machine learning pipeline.