Banner Banner

Materialization and Reuse Optimizations for Production Data Science Pipelines

June 11 , 2022

Many companies and businesses train and deploy machine learning (ML) pipelines to answer prediction queries. In many applications, new training data continuously becomes available. A typical approach to ensure that ML models are up-to-date is to retrain the ML pipelines following a schedule, e.g., every day on the last seven days of data. Several use cases, such as A/B testing and ensemble learning, require many pipelines to be deployed in parallel. Existing solutions train each pipeline separately, which generates redundant data processing. Our goal is to eliminate redundant data processing in such scenarios using materialization and reuse optimizations. Our solution comprises of two main parts. First, we propose a materialization algorithm that given a storage budget, materializes the subset of the artifacts to minimize the run time of the subsequent executions. Second, we design a reuse algorithm to generate an execution plan by combining the pipelines into a directed acyclic graph (DAG) and reusing the materialized artifacts when appropriate. Our experiments show that our solution can reduce the training time by up to an order of magnitude for different deployment scenarios.