PhD Defense: Enhancing Modern Query Federation Systems: Execution Optimization, Performance Prediction, and Systems Assessment

Chujun Song
07.09.2024 13:00 to 15:00

Modern applications distribute operational data across various storage systems in different locations, simplifying application development but complicating data analytics. The prevalent solution has been to use an ETL (Extract, Transform, Load) process to consolidate data from different locations into a centralized data warehouse for further analytical processing. However, this method is computationally intensive, error-prone, compromises data freshness, and creates a scalability bottleneck in performing analytical tasks within such a centralized data warehouse. In contrast, query federation offers a promising alternative by allowing direct analysis of data in its original location, thereby bypassing the ETL process and avoiding its drawbacks. However, current query federation systems are still far from optimal. We describe our work on enhancing modern query federation systems in three aspects: systems assessment, execution optimization, and performance prediction.Systems Assessment: The concept of query federation is not new, with early implementations such as Mariposa paving the way. Modern systems, including Presto, Trino, and Spark, have further developed and refined this feature, significantly enhancing its functionality and efficiency. Despite these advancements, best practices for designing and implementing these systems remain largely unexplored. To address this gap, we introduce a benchmark specifically designed to evaluate the effects of various design strategies on desirable features of query federation systems. It assesses five representative systems, each employing different design strategies, against this benchmark. This part of the work identifies key bottlenecks in different designs, examines the impact of various query optimization and execution strategies, and explores optimal practices for designing the interface between the execution engine and data sources in query federation systems. Additionally, mitigation strategies for these identified bottlenecks are proposed.Execution Optimization: Among the many design choices in query federation systems, we delve deeper into the efficient workload assignment strategy between the query federation system and the data sources, considering that data sources may also possess query processing capabilities. In response to a query, current approaches typically follow one of two paradigms: they either push as much computation as possible down to the underlying system or treat the underlying system solely as a storage engine, attempting to execute the majority of the query processing work within the query federation system itself. We have observed that these approaches tend to result in CPU underutilization on one side—either within the query federation engine or at the data sources. To tackle this inefficiency, we have developed algorithms capable of adjusting the workload distribution either statically or dynamically on both sides to optimize CPU usage and reduce end-to-end query execution latency.Performance Prediction: Accurate and expedient performance estimation before query execution is vital for tasks such as index recommendation, query optimization, query tuning, and cluster management. However, this task continues to pose significant challenges for query federation systems, which typically integrate computation engines and delegate storage management to the underlying system. This architecture often results in unavailable statistics for some tables, rendering many traditional cost estimation methods—those that rely heavily on detailed statistics—impractical. Furthermore, traditional cost estimation methods frequently encounter substantial errors, particularly in complex queries involving multiple joins. In contrast, machine learning-based approaches offer an alternative strategy by leveraging learned models for cost prediction, which have demonstrated superior performance over traditional methods. However, these models are generally evaluated on synthetic workloads, such as TPC-H, within experimental clusters. Real industrial workloads introduce numerous challenges to the application of such methods. In this segment, we assess these methods using actual industrial workloads from the query federation system deployed at LinkedIn to evaluate the performance of these models. We also introduce a new multi-task learning approach that better utilizes operator-level statistics to improve the accuracy of model prediction. Additionally, we empirically investigate the upper bounds of accuracy achievable by these models.