Deduplicated Sampling On-Demand

Luca Zecchini

Vasilis Efthymiou

Felix Naumann

Giovanni Simonini

September 01, 2025

Data practitioners often sample their datasets to produce repre sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. Wedefinededuplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on demand withRadlERisanovelapproachtoproduceacleansample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.

https://www.vldb.org/pvldb/vol18/p2482-zecchini.pdf

BIFOLD AUTHORS

Luca Zecchini