P2D: A Transpiler Framework for Optimizing Data Science Pipelines

Yordan Grigorov

Haralampos Gavriilidis

Sergey Redyuk

Kaustubh Beedkar

Volker Markl

June 18, 2023

In this paper, we propose a transpilation-based approach to optimize data science pipelines that comprise database management systems (DBMSes) and data science runtimes (e.g., Python). Our approach allows to identify DBMS-supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold: First, to improve data loading, by reducing the amount of data to be transferred between runtimes. Second, to exploit DBMS processing capabilities by "pushing down" certain pre-processing operations. Our optimizations are based on an intermediate representation, which allows supporting different data science libraries and DBMSes as frontends and backends respectively, making it suitable for different data science pipelines. Our evaluation with real-world and synthetic datasets shows that our approach can accelerate data science workloads by up to an order of magnitude over state-of-the-art approaches.

https://dl.acm.org/doi/abs/10.1145/3595360.3595853

BIFOLD AUTHORS

Dr. Kaustubh Beedkar

Prof. Dr. Volker Markl