Banner Banner

Accelerating the Data Cleaning Systems Raha and Baran through Task and Data Parallelism

Fatemeh Ahmadi
Yusuf Mandirali
Ziawasch Abedjan

August 26, 2024

The semi-supervised approaches Raha and Baran display compet itive performance in general cleaning scenarios. However the ef fectiveness comes at high runtime costs. In this paper, we show how weimprove the runtimes of Raha and Baran by proposing a newDask-basedparallel architecture that enhances CPU utilization. Further, we propose a shared memorymodel,allowingconcurrently runningworkerstoaccesssharedobjects,therebyreducingmemory consumption by avoiding duplicated data for each worker. Our ap proach demonstrates significant runtime improvements compared to the previous versions of Raha and Baran, which are end-to-end holistic systems.