[RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog - Mailing list pgsql-hackers

From Matthias van de Meent
Subject [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date
Msg-id CAEze2Wgd6WLhksHjOmoPnfhZVw8T8_EjEOXz-Ek-xx_D5rHfCQ@mail.gmail.com
Whole thread Raw
Responses Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
List pgsql-hackers
Note: I am not (currently) planning on implementing this rough idea,
just putting it up to share and document the idea, on request of Tomas
(cc-ed).

The excellent pgconf.de presentation on PostgreSQL's extended
statistics system by Tomas Vondra [0] talked about how the current
default statistics assume the MCVs of columns to be fully independent,
i.e. values of column A do not imply any value of columns B and C, and
that for accurate data on correllated values the user needs to
manually create statistics on the combined columns (by either
STATISTICS or by INDEX).

This is said to be due to limitations in our statistics collector: to
determine the fraction of the table that contains the value, we store
the N most common values with the fraction of their occurrance in the
table. This value is quite exact, but combining these values proves
difficult: there is nothing in the stored value that can confidently
include or exclude parts of the table from a predicate using that MCV,
so we can only assume that the values of two columns are independent.

After the presentation it came to me that if we were to add an
estimator for the number of rows with that value to the MCV lists in
the form of HLL sketches (in addition to or replacing the current
most_common_elem_freqs fractions), we would be able to make better
estimates for multi-column filters by combining the HLL row
cardinality sketches for filters that filter on these MCVs. This would
remove the immediate need for manual statistics with an cartesian
product of the MCVs of those columns with their occurrance fractions,
which significantly reduces the need for the creation of manual
statistics - the need that exists due to planner mis-estimates in
correlated columns. Custom statistics will still be required for
expression statistics, but column correlation estimations _within
MCVs_ is much improved.

How I imagine this would work is that for each value in the MCV, an
HLL is maintained that estimates the amount of distinct tuples
containing that value. This can be h(TID) or h(PK), or anything else
that would uniquely identify returned tuples. Because the keyspace of
all HLLs that are generated are on the same table, you can apply join
and intersection operations on the HLLs of the MCVs (for OR and
AND-operations respectively), and provide fairly accurately estimates
for the amount of tuples that would be returned by the filter on that
table.

The required size of the HLL sketches can be determined by the amount
of tuples scanned during analyze, potentially reducing the size
required to store these HLL sketches from the usual 1.5kB per sketch
to something smaller - we'll only ever need to count nTuples distinct
values, so low values for default_statistics_target would allow for
smaller values for m in the HLL sketches, whilst still providing
fairly accurate result estimates.

Kind regards,

Matthias van de Meent

PS: Several later papers correctly point out that HLL can only count
up to 2^32 due to the use of a hash function that outputs only 32
bits; which is not enough for large tables. HLL++ solves this by using
a hash function that outputs 64 bits, and can thus be considered a
better alternative which provides the same features. But, any other
sketch that provides an accurate (but not necessarily: perfect)
count-distinct of which results can be combined should be fine as
well.

[0]
https://www.postgresql.eu/events/pgconfde2022/schedule/session/3704-an-overview-of-extended-statistics-in-postgresql/



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Reproducible coliisions in jsonb_hash
Next
From: Tomas Vondra
Date:
Subject: Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog