Banner Banner

Towards Efficient and Secure UDF Execution with BabelfishLib (Lightning Talk)

Philipp M. Grulich
Steffen Zeuch and Volker Markl

August 28, 2023

Today, data scientists, web developers, and application developers build complex data processing pipelines by combining different tools and programming languages. To this end, most data processing systems support user-defined functions (UDFs) in common languages like Java, Python, or JavaScript. These UDFs enable users to express arbitrary business logic in their preferred programming language, leverage 3rd-party libraries, and increase the modularity and testability of their data processing pipelines. Although UDFs provide a large degree of freedom, their flexibility comes with a high-performance cost compared to traditional relational queries. As a result, most experts recommend avoiding UDFs whenever possible. To cope with these inefficiencies, research has suggested several strategies. These include translating UDFs to semantically equivalent SQL statements, extending optimizers to the unique properties of UDFs, and devising efficient execution strategies that mitigate the bottlenecks of UDFs. These approaches, while delivering performance improvements, necessitate substantial engineering efforts and amplify system complexity, which hinders their widespread adoption. To improve this situation, we propose in this talk BabelfishLib, which provides our Babelfish Engine as an extensible component for the efficient and secure execution of UDFs. In an environment where virtually every data management system requires UDF support, BabelfishLib can centralize these efforts and provide a unified UDF runtime that can be used across different systems. In particular, BabelfishLib targets three major design goals. First, it provides efficient execution strategies for UDFs in different programming languages. Second, it ensures that the execution of untrusted UDF code is isolated from the data processing system, guaranteeing system security. Third, it analyzes UDFs and provides information for further query optimizations. As a result, BabelfishLib mitigates the performance overhead of UDFs in state-of-the-art systems while it ensures security and isolation at the same time. Currently, we leverage BabelfishLib to accelerate UDFs in our data processing platform NebulaStream.
We believe that BabelfishLib can be a first step towards a unified accelerator for UDFs, which can be integrated across different data processing systems. Furthermore, it provides a playground for further research focusing on specific aspects of the acceleration of UDF. Finally, through this presentation, we intend to spark a discussion across the community to consolidate requirements for efficient UDF execution and combine different efforts in the same direction.