CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines

Saeed Fathollahzadeh

Essam Mansour

Matthias Boehm

September 01, 2025

Data-centric machine learning (ML) pipelines extend traditional ML pipelines—of feature transformations, hyper-parameter tun ing, and model training—by additional pre-processing steps for data cleaning, data augmentation, and feature engineering to create high-quality data with good coverage. Finding effective data-centric MLpipelines is still a labor- and compute-intensive process though. While AutoML tools use effective search strategies, they struggle to scale with large datasets. Large language models (LLMs) show promise for code generation but face challenges in generating data centric MLpipelinesduetoprivatedatasetsnotseenduringtraining, complex pre-processing requirements, and the need for mitigating hallucinations. These demands exceed typical code generation as it requires actions tailored to the characteristics and requirements of a particular dataset. This paper introduces CatDB, a comprehensive, LLM-based system for generating effective, error-free, and efficient data-centric ML pipelines. CatDB leverages data catalog information and refined metadata to dynamically create dataset specific rules (instructions) to guide the LLM. Moreover, CatDB includes a robust mechanism for automatic validation and error handling of the generated pipeline. Our experimental results show that CatDB reliably generates effective ML pipelines across diverse datasets, achieving accuracy comparable to or better than existing LLM-based systems, standalone AutoML tools, and combined work f lows of data cleaning and AutoML tools, while delivering up to orders of magnitude faster performance on large datasets.

https://fathollahzadeh.github.io/papers/CatDB_VLDB2025.pdf

BIFOLD AUTHORS

Prof. Dr. Matthias Böhm