Banner Banner

Finding What You're Looking For: A Distribution-Aware Dataset Search Engine in Action

Lennart Behme
Leonard Geißler
Pratham Agrawal
Emil Badura
Benjamin Ueber
Kaustubh Beedkar
Volker Markl

June 22, 2025

The growing volume of academic, commercial, and governmental datasets distributed across countless independent repositories calls for dataset search engines that can answer queries only using publicly shared metadata instead of relying on raw data access. However, the keyword search interfaces of existing metadata-based search engines fail to capture complex user needs, such as distributional requirements, thereby limiting their effectiveness. In this demonstration, we present the first end-to-end system for distribution-aware dataset search over decentralized data repositories. Our prototype combines existing search techniques with recently proposed percentile predicates to provide more powerful query capabilities. Based on our novel Dataset Query Language and a distribution-aware index, the system enables efficient, flexible search without access to raw data. To demonstrate its utility, we curated over 150,000 profiles of tabular datasets from Kaggle and enriched them with statistical information, enabling attendees to explore distribution-aware search and the trade-offs involved in system configuration.