Banner Banner

Query Processing and Interoperability Mechanisms for Federated Data Systems

Haralampos Gavriilidis

July 22, 2025

Federated data systems support modern analytics pipelines by enabling access to data spread across heterogeneous and autonomous database management systems (DBMSes) as well as raw files in various formats. These systems operate in complex environments, with DBMSes distributed across physical sites, connected via diverse networks, and exposing data in various formats. To function effectively in such settings, federated data systems require efficient query processing capabilities and robust interoperability mechanisms for integrating diverse systems and file formats. However, existing approaches face significant limitations in enabling effi cient analytics in these settings. Traditional systems rely on the centralized mediator-wrapper architecture, which introduces an additional execution layer, increases resource usage, and causes redundant data movement and high latency. Data movement mechanisms are typically non-configurable and cannot adapt to heterogeneous, dynamic environments, resulting in poor performance. Although spreadsheets remain widely used for data management and exchange, their integration into pipelines is limited by the inefficiency of XML-based parsers. In this thesis, we make three core contributions to improve the efficiency of federated data systems: First, we present XDB, a middleware system for federated query processing without a mediating execution engine. XDB delegates the full execution–including cross-database joins to the underlying DBMSes. It features an optimizer that rewrites queries into delegation plans, and a delegation engine that executes these plans on the underlying systems through their declarative interfaces, enabling decentralized query execution with reduced data movement, lower latency, and more efficient resource usage. Second, we introduce XDBC, a modular data transfer framework for fast, scalable data move ment in heterogeneous and dynamic environments. XDBC decomposes the transfer pipeline into configurable components with multiple physical implementations. Its optimizer selects efficient configurations based on workload and environment characteristics, outperforming existing solutions in both runtime and resource efficiency. Third, we propose SheetReader, a high-performance spreadsheet parser optimized for mem ory efficiency and low latency. It introduces low-level, spreadsheet-specific optimizations and employs parallelism to significantly reduce memory usage and latency compared to traditional XML-based parsers. Its modular design makes it suitable for integration into data science environments and DBMSes, enabling efficient spreadsheet ingestion across diverse runtimes. Together, these contributions addresscorechallengesinfederatedqueryprocessing,adaptive data movement, and file ingestion. They improve the efficiency and flexibility of analytics over federated data systems in complex, heterogeneous environments.