PhD Proposal: Elevating Data Systems: A Cost-Driven Architectural Approach to Query Processing and Schema Compilation
IRB-5137 https://umd.zoom.us/j/7123055142
Many graph workloads and ML pipelines rely on queries involving many-to-many relationships, which often produce large intermediate results even when final outputs are small. This intermediate-result explosion leads to high memory usage, long runtime, and statistical distortions in ML feature engineering. These challenges highlight the ubiquity and difficulty of handling many-to-many joins across graph analytics and ML workflows, motivating a range of prior techniques such as worst-case optimal joins, factorized representations, semi-join reduction, and predicate transfer. This proposal addresses the core query optimization problem underlying these approaches. We develop a refined cost model that more accurately captures the execution behavior of plans involving many-to-many joins, and we introduce optimization algorithms that leverage these cost functions to choose effective execution strategies. We integrate these techniques into a modern vectorized execution engine, and our experiments show that the optimal execution strategy depends strongly on query structure and data characteristics. This demonstrates the need for a fully cost-based approach to selecting the most efficient execution techniques in the presence of many-to-many joins.
We next introduce a cost-based mapping compilation approach for translating a conceptual schema into an efficient relational schema. Although relational databases provide strong physical data independence, they offer far less logical independence: decisions about representing relationships, multi-valued attributes, inheritance, and polymorphic associations must be fixed at design time and can deeply affect performance. These choices are typically made manually and become locked into the logical schema. Our goal is to automate and optimize these decisions using the conceptual schema and observed workload. By making the mapping process systematic and data-driven, the logical schema can adapt to changing workloads while remaining independent of both the application-level model and the underlying physical storage.
Finally, we show that integrity enforceability depends critically on schema compilation choices. Semantically equivalent relational representations arising from different mappings of inheritance, polymorphic associations, relationships, and weak entities can differ fundamentally in whether constraints remain declarative or require procedural enforcement. We plan to address this gap through a systematic framework by analyzing how relational mapping strategies influence integrity enforcement.