A framework to efficiently create training data for optimizers

A demo paper co-authored by a group of BIFOLD researchers on “Farming Your ML-based Query Optimizer’s Food” presented at the virtual conference ICDE 2022 has won the best demo award. The award committee members have unanimously chosen this demonstration based on the relevance of the problem, the high potential of the proposed approach and the excellent presentation.

As machine learning is becoming a core component in query optimizers, e.g., to estimate costs or cardinalities, it is critical to collect a large amount of labeled training data to build this machine learning models. The training data should consist of diverse query plans with their label (execution time or cardinality). However, collecting such a training dataset is a very tedious and time-consuming task: It requires both developing numerous plans and executing them to acquire ground-truth labels. The latter can take days if not months, depending on the size of the data.

In a research paper presented last year at SIGMOD 2021 the authors presented DataFarm, a framework for efficiently creating training data for optimizers with learning-based components. This demo paper extends DataFarm with an intuitive graphical user interface which allows users to get informative details of the generated plans and guides them through the generation process step-by-step. As an output of DataFarm, users can download both the generated plans to use as a benchmark and the training data (jobs with their labels).

The publication in detail:

Robin van de Water, Francesco Ventura, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Volker Markl: Farming Your ML-based Query Optimizer’s Food (to appear)

Machine learning (ML) is becoming a core component in query optimizers, e.g., to estimate costs or cardinalities. This means large heterogeneous sets of labeled query plans or jobs (i.e., plans with their runtime or cardinality output) are needed. However, collecting such a training dataset is a very tedious and time consuming task: It requires both developing numerous jobs and executing them to acquire ground-truth labels. We demonstrate DATAFARM, a novel framework for efficiently generating and labeling training data for ML-based query optimizers to overcome these issues. DATAFARM enables generating training data tailored to users’ needs by learning from their existing workload patterns, input data, and computational resources. It uses an active learning approach to determine a
subset of jobs to be executed and encloses the human into the loop, resulting in higher quality data. The graphical user interface of DATAFARM allows users to get informative details of the generated jobs and guides them through the generation process step-by-step. We show how users can intervene and
provide feedback to the system in an iterative fashion. As an output, users can download both the generated jobs to use as a benchmark and the training data (jobs with their labels).