Data Science Abstractions and Systems, Performance-Accuracy Tradeoffs in Data Science, Data Cleaning Pipelines and Optimization
The mission of the Big Data Engineering group, led by Prof. Dr. Matthias Böhm, is to simplify data science by providing high-level, data-science-centric abstractions and building systems and tools to execute these tasks in an efficient and scalable manner. The general research interests include the exploration of performance-accuracy tradeoffs, tooling (script generators, label generation, advisors, etc.), seamless data augmentation, cleaning, feature engineering, model debugging and deployment, cost-effective cloud deployments, advanced optimization techniques, adaptive data storage and indexing, and the exploitation of modern hardware.
Current research focuses on:
• Data Cleaning Pipelines: Automatic enumeration of data cleaning pipelines for target ML application, hyper parameter optimization of cleaning primitives.
• Model Debugging: Finding the top-k data slices where a trained model underperforms, linear-algebra-based enumeration and pruning algorithms.
• Fine-grained Lineage Tracing and Reuse: Fine-grained, multi-level lineage tracing for versioning and reuse, lineage deduplication, full and partial reuse of intermediates.
• Federated Linear Algebra and Parameter Servers: ML model training on federated raw data without central data consolidation, plan generation under awareness of privacy constraints, federated linear algebra programs and parameter servers.
• Workload-aware Data Reorganization: Compression under awareness of data and workload (linear algebra program) characteristics, asynchronous data reorganization in standing executors (e.g., at standing federated workers).
• Code Generation for Heterogeneous HW: Extended operator fusion and code generation for GPUs and heterogeneous devices, including sparsity exploitation across operations.
Matthias Boehm, Matteo Interlandi, Chris Jermaine
Optimizing Tensor Computations: From Applications to Compilation and Runtime Techniques
Saeed Fathollahzadeh, Matthias Boehm
GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example
Sebastian Baunsgaard, Matthias Boehm, Kevin Innerebner, Mito Kehayov, Florian Lackner, Olga Ovcharenko, Arnab Phani, Tobias Rieger, David Weissteiner, Sebastian Benjamin Wrede
Federated Data Preparation, Learning, and Debugging in Apache SystemDS
Arnab Phani, Lukas Erlbacher, Matthias Boehm
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Professor Dr. Pinar Tözün joins BIFOLD as a Visiting Scientist
Beginning on August 28, 2023, Professor Tözün will join BIFOLD as a Visiting Scientist. Above all, she will collaborate with Prof. Dr. Matthias Böhm and the Big Data Engineering group.

8 researchers represented BIFOLD at SIGMOD 2023
Eight members of the BIFOLD team took the chance to showcase their recent work at SIGMOD 2023 in Seattle through a diverse array of presentations, including research papers, workshop papers, and a demo paper – all of them underscoring the institute's commitment to cutting-edge research in the field of data management.

Research Group Lead

Doctoral Researcher

Doctoral Researcher

Doctoral Researcher