Dr. Stefan Grafberger

Snowfalke
Management of Data Science Processes (DEEM)

Franklinstraße 28/29, 10623 Berlin

Dr. Stefan Grafberger

I am a Ph.D. student at BIFOLD and TU Berlin in the DEEM Lab, conducting research at the intersection of data management and machine learning. I mainly publish at conferences like SIGMOD and VLDB.

My Ph.D. advisors are Sebastian Schelter and Paul Groth. I work on responsible data management (also in collaboration with Julia Stoyanovich). I spent the first three years of my Ph.D. at the University of Amsterdam in the Intelligent Data Engineering Lab, before Sebastian transitioned to TU Berlin. Before my Ph.D., I did my masters at TU Munich with Thomas Neumann and Alfons Kemper and focused on databases.

During my studies, I interned with Microsoft GSL, Amazon Research, Oracle Labs, and worked as a research assistant at TU Munich. I also interned and worked as a working student at TNG Technology Consulting in Munich and worked as a teaching assistant at University of Augsburg.

In the past, I have been working on deequ, a library for ‘unit-testing’ large datasets with Apache Spark, PGX, an in-memory graph analytics framework, and Umbra, a disk-based database with in-memory performance. Currently, I work on mlinspect and mlwhatif. The goal is to diagnose and mitigate robustness and reliability issues in machine learning pipelines.

Stefan Grafberger, Paul Groth, Sebastian Schelter

mlidea: Interactively Improving ML Data Preparation Code via “Shadow Pipelines”

July 17, 2025

https://deem.berlin/pdf/mlidea-demo-2.pdf

Stefan Grafberger, Madelon Hulsebos, Matteo Interlandi, Shreya Shankar

Ninth Workshop on Data Management for End-to-End Machine Learning (DEEM)

June 22, 2025

https://doi.org/10.1145/3722212.3724483

Stefan Grafberger, Hao Chen, Olga Ovcharenko, Sebastian Schelte

Towards Regaining Control over Messy Machine Learning Pipelines

March 07, 2025

https://deem.berlin/pdf/dais-lester.pdf

Sebastian Schelter, Shubha Guha, Stefan Grafberger

Automated Provenance-Based Screening of ML Data Preparation Pipelines

September 30, 2024

https://doi.org/10.1007/s13222-024-00483-4

Sebastian Schelter, Stefan Grafberger

Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!

September 16, 2024

https://doi.org/10.48550/arXiv.2409.10081

BIFOLD Update| Jan 19, 2025

BIFOLD researcher co-authors paper on next-generation Query Optimization at CIDR

At CIDR 2025 in Amsterdam, BIFOLD researcher Stefan Grafberger presents a paper co-authored during his Microsoft research internship. The study explores "Query Optimizer as a Service," promising simpler, more efficient data system development.

Data Management| Sep 17, 2024

Publication Highlight - Snapcase

At the VLDB 2024 conference, the BIFOLD Research Group DEEM Lab introduced "Snapcase," a demo paper that addresses the concept of machine unlearning.

Dr. Stefan Grafberger

Dr. Stefan Grafberger

PUBLICATIONS

mlidea: Interactively Improving ML Data Preparation Code via “Shadow Pipelines”

Ninth Workshop on Data Management for End-to-End Machine Learning (DEEM)

Towards Regaining Control over Messy Machine Learning Pipelines

Automated Provenance-Based Screening of ML Data Preparation Pipelines

Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!

NEWS

BIFOLD researcher co-authors paper on next-generation Query Optimization at CIDR

Publication Highlight - Snapcase