Banner Banner

ICDE 2021 honors BIFOLD researchers with best paper award

The 37. IEEE International Conference on Data Engineering (ICDE) 2021 honored the paper “Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance” of six BIFOLD researchers with the Best Paper Award. Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz and Volker Markl were honored during the award session of the conference on April 21, 2021.

In modern data analytics, companies often want to analyze large datasets. For example, a company might want to analyze its entire network of user interactions in order to better understand how its products are used. Scaling data analysis to large datasets is a widespread need in many different contexts. Modern dataflow systems, such as Apache Flink and Apache Spark are widely used to accomplish that need. But the kind of algorithms that are used for data analysis are getting more and more complex. Complex algorithms are often iterative in nature, meaning that they gradually refine the results by repeated execution of a computation. A well-known example is the PageRank algorithm, which is used for ranking the importance of nodes in a network, for example ranking websites in Google search results. Both dataflow systems Apache Flink and Apache Spark have weaknesses when implementing iterative algorithms: they are either hard to use, or have suboptimal performance.

This paper introduces a new system, which combines an easy-to-use language with efficient execution. It is able to keep the language simple by relying on techniques from the programming language research literature, in addition to the database and distributed systems research literature, which earlier systems relied on. The simpler language makes it easy for users to run advanced analytics on large datasets. This is important for data scientists, who can then concentrate on the analytics instead of needing to become experts on the internal workings of the systems.

The annual IEEE International Conference on Data Engineering (ICDE) is the flagship IEEE conference addressing research issues in designing, building, managing, and evaluating advanced data-intensive systems and applications. For over three decades, IEEE ICDE has been a leading forum for researchers, practitioners, developers, and users to explore cutting-edge ideas and to exchange techniques, tools, and experiences.


Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, Volker Markl

Modern data analysis tasks often involve control flow statements, such as iterations. Common examples are PageRank and K-means. To achieve scalability, developers usually implement data analysis tasks in distributed dataflow systems, such as Spark and Flink. However, for tasks with control flow statements, these systems still either suffer from poor performance or are hard to use. For example, while Flink supports iterations and Spark provides ease-of-use, Flink is hard to use and Spark has poor performance for iterative tasks. As a result, developers typically have to implement different workarounds to run their jobs with control flow statements in an easy and efficient way. We propose Mitos, a system that achieves the best of both worlds: it achieves both high performance and ease-of-use. Mitos uses an intermediate representation that abstracts away specific control flow statements and is able to represent any imperative control flow. This facilitates building the dataflow graph and coordinating the distributed execution of control flow in a way that is not tied to specific control flow constructs. Our experimental evaluation shows that the performance of Mitos is more than one order of magnitude better than systems that launch new dataflow jobs for every iteration step. Remarkably, it is also up to 10.5 times faster than Flink, which has native iteration support, while matching the ease-of-use of Spark.

To be published in the Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021, April 19 – 22