Banner Banner

ExDRa: Exploratory Data Science on Federated Raw Data

Sebastian Baunsgaard
Matthias Boehm
Ankit Chaudhary
Behrouz Derakhshan
Stefan Geißelsöder
Philipp Marian Grulich
Michael Hildebrand
Kevin Innerebner
Volker Markl
Claus Neubauer
Sarah Osterburg
Olga Ovcharenko
Sergey Redyuk
Tobias Rieger
Alireza Rezaei Mahdiraji
Sebastian Benjamin Wrede
Steffen Zeuch

June 18, 2021

Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the overall system architecture, selected features of SystemDS' new federated backend (for federated linear algebra programs, federated parameter servers, and federated data preparation), as well as promising initial results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges. In this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, even new markets.