Banner Banner

mlidea: Interactively Improving ML Data Preparation Code via “Shadow Pipelines”

Stefan Grafberger
Paul Groth
Sebastian Schelter

July 17, 2025

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. To address this challenge, we propose to assist data scientists with automatically derived interactive suggestions for pipeline improvements during this development cycle. We demonstrate mlidea, a library to gener ate interactive suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect po tential issues, try out modifications for improvements, and suggest and explain these modifications to the user. Our system uses in cremental view maintenance to enable data scientists to quickly iterate on their code and to ensure low-latency maintenance of the shadow pipelines. We demonstrate how our system improves code for various domains with three interactive shadow pipelines: f ixing mislabeled rows, enhancing robustness against data quality problems, and improving pipeline performance on data slices with subpar predictions.