Banner Banner

Enabling Integrated Data Analysis Pipelines on Heterogeneous Hardware through Holistic Extensibility

Patrick Damme
Matthias Böhm

March 06, 2023

Integrated data analysis (IDA) pipelines, that combine data management/query processing, high-performance computing, and machine learning training/scoring, become increasingly common in practice. Systems of these areas share many compilation and runtime techniques, and stress every hardware aspect of storage, computation, and networking. Accordingly, these systems are strongly impacted by hardware challenges such as the end of Dennard scaling and the end of Moore’s law, which ultimately lead to dark silicon and increasing specialization at device level (CPUs, GPUs, FPGAs, ASICs), storage level (computational memory/storage, storage hierarchies), and workload level (data types and sparsity). While this makes research on novel and heterogeneous hardware more exciting than ever, researchers are increasingly confronted with the question of how to integrate their prototypes to evaluate their impact on end-to-end IDA pipelines. Building yet another dedicated system offers a lot of flexibility, but requires substantial infrastructure efforts. However, enhancing an established system requires deep knowledge of the system internals and can be very hard. Thus, already in the 1980/90s, there was a wave of research on extensible DBMSs [CH90]. One of the most famous systems developed at that time is Postgres, which allows adding user-defined data types, functions, and access methods [SAH87]. Since then, concepts for extensibility and variability have been proposed for various system components, at different abstraction levels, and in different kinds of data systems. Recently, extensibility has also gained traction in the context of component-based systems [HD23]. However, to the best of our knowledge, there is no system infrastructure that holistically supports user extensions for all components relevant to the efficient execution of IDA pipelines on today’s heterogeneous compute/storage hardware. To overcome this problem, we propose holistic extensibility. In this talk, we present the concept of holistic extensibility for IDA pipelines, sketch how we approach this concept in DAPHNE, and provide an overview of our ongoing work.