I am writing to seek guidance and potential collaboration on a project involving cardinality estimation improvements in PostgreSQL. The project aims to enhance join result cardinality estimation by incorporating HyperLogLog (HLL) estimates alongside the existing join operator framework.
Project Overview:
Goal: Improve the accuracy of join cardinality estimation using HLL sketches
Scope: Modify the existing join estimation logic to consider HLL-based distinct count estimates
Expected benefit: More accurate query plans for joins involving columns with high cardinality
Technical Areas of Interest:
Current implementation of join selectivity estimation in src/backend/optimizer
Integration points for HLL sketches within the existing statistics framework
Potential modifications needed to the join operator logic
Questions for the Community:
Has similar work been attempted or discussed previously?
What would be the preferred approach to integrate HLL estimates with the existing join estimation framework?
Are there specific areas of the codebase I should focus on initially?
Would this enhancement align with the project's current direction for query optimization?
I have previously worked with tweaking the BufferReplacement policy for Postgres wherein I implemented a LazyBufferReplacementPolicy using FIFO queues, swapping out the clock sweep algorithm, so I have a bit of familiarity with the Postgres codebase.
I would greatly appreciate any guidance, feedback, or suggestions from the community. I'm happy to provide more detailed information about the proposed approach or clarify any aspects of the project.