Banner Banner

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

Patrick Damme
Marius Birkenbach
Constantinos Bitsakos
Matthias Boehm
Philippe Bonnet
Florina Ciorba
Mark Dokter
Pawel Dowgiallo
Ahmed Eleliemy
Christian Faerber
Georgios Goumas
Dirk Habich
Niclas Hedam
Marlies Hofer
Wenjun Huang
Kevin Innerebner
Vasileios Karakostas
Roman Kern
Tomaž Kosar
Alexander Krause
Daniel Krems
Andreas Laber
Wolfgang Lehner
Eric Mier
Marcus Paradies
Bernhard Peischl
Gabrielle Poerwawinata
Stratos Psomadakis
Tilmann Rabl
Piotr Ratuszniak
Pedro Silva
Nikolai Skuppin
Andreas Starzacher
Benjamin Steinwender
Ilin Tolovski
Pınar Tözün
Wojciech Ulatowski
Yuanyuan Wang
Izajasz Wrosz
Aleš Zamuda
Ce Zhang
Xiao Xiang Zhu

January 09, 2022

Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring—become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.