Banner Banner

DataFarm: Farm Your ML-based Query Optimizer's Food! - Human-Guided Training Data Generation -

Robin Van De Water
Francesco Ventura
Zoi Kaoudi
Jorge-Arnulfo Quiané-Ruiz
Volker Markl

March 09 , 2022

Machine learning (ML) is becoming a core component in query optimizers, e.g., to estimate costs or cardinalities. This means large heterogeneous sets of labeled query plans or jobs (i.e., plans with their runtime or cardinality output) are needed. However, collecting such a training dataset is a very tedious and time-consuming task: It requires both developing numerous jobs and executing them to acquire ground-truth labels. We demonstrate Datafarm,a novel framework for efficiently generating and labeling training data for ML-based query optimizers to overcome these issues. Datafarmenables generating training data tailored to users' needs by learning from their existing workload patterns, input data, and computational resources. It uses an active learning approach to determine a subset of jobs to be executed and encloses the human into the loop, resulting in higher quality data. The graphical user interface of Datafarmallows users to get informative details of the generated jobs and guides them through the generation process step-by-step. We show how users can intervene and provide feedback to the system in an iterative fashion. As an output, users can download both the generated jobs to use as a benchmark and the training data (jobs with their labels).