Machine learning (ML) and data science workflows are inherently exploratory. Data scientists pose hypotheses, integrate the necessary data, and run ML pipelines of data cleaning, feature engineering, model selection, and hyperparameter tuning. The repetitive nature of these workflows, and their hierarchical composition from building blocks, cause high computational redundancy. This redundancy can be addressed by reusing already computed intermediates. Existing reuse approaches are coarse-grained, treating entire algorithms as black boxes. As a result, they fail to eliminate fine-grained redundancy and to handle internal non-determinism. This coarse-grained view also prevents enabling internal parallelization and optimization within complex, monolithic ML workloads, such as feature transformations. Feature transformations, which convert raw data into numerical matrices for training or scoring, involve multiple passes and expensive string processing, requiring special handling for efficient execution. Modern ML systems leverage multiple backends, including CPUs, GPUs, and distributed execution platforms like Apache Spark or Ray. Depending on workload and cluster characteristics, these systems typically compile an ML pipeline into hybrid plans of in-memory CPU, GPU, and distributed operations. Prior work applied tailor-made techniques for reusing intermediates in specific backend scenarios. However, achieving efficient holistic reuse in multi-backend ML systems remains a challenge due to its tight coupling with other aspects such as memory management, data exchange, parallelization strategies, and operator scheduling. In this thesis, we introduce a principled framework for efficient, fine-grained lineage tracing and holistic, application-agnostic, multi-backend reuse and memory management, as well as fine-grained task scheduling for feature transformations. The lineage trace of an intermediate exactly identifies that intermediate, facilitating reuse. However, fine-grained lineage tracing introduces challenges, such as the high memory usage overhead. We address the large size of lineage traces with lineage deduplication for loops and functions, and we exploit full and partial reuse opportunities across the program hierarchy. The core component of our framework is a hierarchical lineage-based reuse cache, which acts as a unified abstraction and manages reuse, memory recycling, data exchange, and cache eviction across backends. To address backend-related challenges such as lazy evaluation, asynchronous data exchange, and memory allocation overheads, we devise a suite of cache management policies. Moreover, to efficiently execute feature transformation workloads, we construct a fine-grained task-graph for a set of transformations and optimize the plan according to data characteristics. At runtime, we execute this plan in a cache-conscious manner while performing sub-operator-level reuse of internal intermediates. Finally, we extend an optimizing ML system compiler by special operators for asynchronous exchange, speculative cache management, and related operator ordering for concurrent execution. Our experiments across diverse ML tasks and pipelines show significant improvements compared to state-of-the-art ML systems and specialized frameworks.