[RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog |
Date | |
Msg-id | CAEze2Wgd6WLhksHjOmoPnfhZVw8T8_EjEOXz-Ek-xx_D5rHfCQ@mail.gmail.com Whole thread Raw |
Responses |
Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
|
List | pgsql-hackers |
Note: I am not (currently) planning on implementing this rough idea, just putting it up to share and document the idea, on request of Tomas (cc-ed). The excellent pgconf.de presentation on PostgreSQL's extended statistics system by Tomas Vondra [0] talked about how the current default statistics assume the MCVs of columns to be fully independent, i.e. values of column A do not imply any value of columns B and C, and that for accurate data on correllated values the user needs to manually create statistics on the combined columns (by either STATISTICS or by INDEX). This is said to be due to limitations in our statistics collector: to determine the fraction of the table that contains the value, we store the N most common values with the fraction of their occurrance in the table. This value is quite exact, but combining these values proves difficult: there is nothing in the stored value that can confidently include or exclude parts of the table from a predicate using that MCV, so we can only assume that the values of two columns are independent. After the presentation it came to me that if we were to add an estimator for the number of rows with that value to the MCV lists in the form of HLL sketches (in addition to or replacing the current most_common_elem_freqs fractions), we would be able to make better estimates for multi-column filters by combining the HLL row cardinality sketches for filters that filter on these MCVs. This would remove the immediate need for manual statistics with an cartesian product of the MCVs of those columns with their occurrance fractions, which significantly reduces the need for the creation of manual statistics - the need that exists due to planner mis-estimates in correlated columns. Custom statistics will still be required for expression statistics, but column correlation estimations _within MCVs_ is much improved. How I imagine this would work is that for each value in the MCV, an HLL is maintained that estimates the amount of distinct tuples containing that value. This can be h(TID) or h(PK), or anything else that would uniquely identify returned tuples. Because the keyspace of all HLLs that are generated are on the same table, you can apply join and intersection operations on the HLLs of the MCVs (for OR and AND-operations respectively), and provide fairly accurately estimates for the amount of tuples that would be returned by the filter on that table. The required size of the HLL sketches can be determined by the amount of tuples scanned during analyze, potentially reducing the size required to store these HLL sketches from the usual 1.5kB per sketch to something smaller - we'll only ever need to count nTuples distinct values, so low values for default_statistics_target would allow for smaller values for m in the HLL sketches, whilst still providing fairly accurate result estimates. Kind regards, Matthias van de Meent PS: Several later papers correctly point out that HLL can only count up to 2^32 due to the use of a hash function that outputs only 32 bits; which is not enough for large tables. HLL++ solves this by using a hash function that outputs 64 bits, and can thus be considered a better alternative which provides the same features. But, any other sketch that provides an accurate (but not necessarily: perfect) count-distinct of which results can be combined should be fine as well. [0] https://www.postgresql.eu/events/pgconfde2022/schedule/session/3704-an-overview-of-extended-statistics-in-postgresql/
pgsql-hackers by date: