Thread: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

[RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

From
Matthias van de Meent
Date:
Note: I am not (currently) planning on implementing this rough idea,
just putting it up to share and document the idea, on request of Tomas
(cc-ed).

The excellent pgconf.de presentation on PostgreSQL's extended
statistics system by Tomas Vondra [0] talked about how the current
default statistics assume the MCVs of columns to be fully independent,
i.e. values of column A do not imply any value of columns B and C, and
that for accurate data on correllated values the user needs to
manually create statistics on the combined columns (by either
STATISTICS or by INDEX).

This is said to be due to limitations in our statistics collector: to
determine the fraction of the table that contains the value, we store
the N most common values with the fraction of their occurrance in the
table. This value is quite exact, but combining these values proves
difficult: there is nothing in the stored value that can confidently
include or exclude parts of the table from a predicate using that MCV,
so we can only assume that the values of two columns are independent.

After the presentation it came to me that if we were to add an
estimator for the number of rows with that value to the MCV lists in
the form of HLL sketches (in addition to or replacing the current
most_common_elem_freqs fractions), we would be able to make better
estimates for multi-column filters by combining the HLL row
cardinality sketches for filters that filter on these MCVs. This would
remove the immediate need for manual statistics with an cartesian
product of the MCVs of those columns with their occurrance fractions,
which significantly reduces the need for the creation of manual
statistics - the need that exists due to planner mis-estimates in
correlated columns. Custom statistics will still be required for
expression statistics, but column correlation estimations _within
MCVs_ is much improved.

How I imagine this would work is that for each value in the MCV, an
HLL is maintained that estimates the amount of distinct tuples
containing that value. This can be h(TID) or h(PK), or anything else
that would uniquely identify returned tuples. Because the keyspace of
all HLLs that are generated are on the same table, you can apply join
and intersection operations on the HLLs of the MCVs (for OR and
AND-operations respectively), and provide fairly accurately estimates
for the amount of tuples that would be returned by the filter on that
table.

The required size of the HLL sketches can be determined by the amount
of tuples scanned during analyze, potentially reducing the size
required to store these HLL sketches from the usual 1.5kB per sketch
to something smaller - we'll only ever need to count nTuples distinct
values, so low values for default_statistics_target would allow for
smaller values for m in the HLL sketches, whilst still providing
fairly accurate result estimates.

Kind regards,

Matthias van de Meent

PS: Several later papers correctly point out that HLL can only count
up to 2^32 due to the use of a hash function that outputs only 32
bits; which is not enough for large tables. HLL++ solves this by using
a hash function that outputs 64 bits, and can thus be considered a
better alternative which provides the same features. But, any other
sketch that provides an accurate (but not necessarily: perfect)
count-distinct of which results can be combined should be fine as
well.

[0]
https://www.postgresql.eu/events/pgconfde2022/schedule/session/3704-an-overview-of-extended-statistics-in-postgresql/



On 5/15/22 21:55, Matthias van de Meent wrote:
> Note: I am not (currently) planning on implementing this rough idea,
> just putting it up to share and document the idea, on request of Tomas
> (cc-ed).
> 
> The excellent pgconf.de presentation on PostgreSQL's extended
> statistics system by Tomas Vondra [0] talked about how the current
> default statistics assume the MCVs of columns to be fully independent,
> i.e. values of column A do not imply any value of columns B and C, and
> that for accurate data on correllated values the user needs to
> manually create statistics on the combined columns (by either
> STATISTICS or by INDEX).
> 
> This is said to be due to limitations in our statistics collector: to
> determine the fraction of the table that contains the value, we store
> the N most common values with the fraction of their occurrance in the
> table. This value is quite exact, but combining these values proves
> difficult: there is nothing in the stored value that can confidently
> include or exclude parts of the table from a predicate using that MCV,
> so we can only assume that the values of two columns are independent.
> 
> After the presentation it came to me that if we were to add an
> estimator for the number of rows with that value to the MCV lists in
> the form of HLL sketches (in addition to or replacing the current
> most_common_elem_freqs fractions), we would be able to make better
> estimates for multi-column filters by combining the HLL row
> cardinality sketches for filters that filter on these MCVs. This would
> remove the immediate need for manual statistics with an cartesian
> product of the MCVs of those columns with their occurrance fractions,
> which significantly reduces the need for the creation of manual
> statistics - the need that exists due to planner mis-estimates in
> correlated columns. Custom statistics will still be required for
> expression statistics, but column correlation estimations _within
> MCVs_ is much improved.
> 
> How I imagine this would work is that for each value in the MCV, an
> HLL is maintained that estimates the amount of distinct tuples
> containing that value. This can be h(TID) or h(PK), or anything else
> that would uniquely identify returned tuples. Because the keyspace of
> all HLLs that are generated are on the same table, you can apply join
> and intersection operations on the HLLs of the MCVs (for OR and
> AND-operations respectively), and provide fairly accurately estimates
> for the amount of tuples that would be returned by the filter on that
> table.
> > The required size of the HLL sketches can be determined by the amount
> of tuples scanned during analyze, potentially reducing the size
> required to store these HLL sketches from the usual 1.5kB per sketch
> to something smaller - we'll only ever need to count nTuples distinct
> values, so low values for default_statistics_target would allow for
> smaller values for m in the HLL sketches, whilst still providing
> fairly accurate result estimates.
> 

I think it's an interesting idea. In principle it allows deducing the
multi-column MCV for arbitrary combination of columns, not determined in
advance. We'd have the MCV with HLL instead of frequencies for columns
A, B and C:

(a1, hll(a1))
(a2, hll(a2))
(...)
(aK, hll(aK))


(b1, hll(b1))
(b2, hll(b2))
(...)
(bL, hll(bL))

(c1, hll(c1))
(c2, hll(c2))
(...)
(cM, hll(cM))

and from this we'd be able to build MCV for any combination of those
three columns.

And in some sense it might even be more efficient/accurate, because the
MCV on (A,B,C) might have up to K*L*M items. if there's 100 items in
each column, that'd be 1,000,000 combinations, which we can't really
store (target is up to 10k). And even if we could, it'd be 1M
combinations with frequencies (so ~8-16B per combination).

While with the MCV/HLL, we'd have 300 items and HLL. Assuming 256-512B
HLL would be enough, that's still way smaller than the multi-column MCV.

Even with target=10k it'd still be cheaper to store the separate MCV
with HLL values, if I count right, and there'd be no items omitted from
the MCV.

> Kind regards,
> 
> Matthias van de Meent
> 
> PS: Several later papers correctly point out that HLL can only count
> up to 2^32 due to the use of a hash function that outputs only 32
> bits; which is not enough for large tables. HLL++ solves this by using
> a hash function that outputs 64 bits, and can thus be considered a
> better alternative which provides the same features. But, any other
> sketch that provides an accurate (but not necessarily: perfect)
> count-distinct of which results can be combined should be fine as
> well.
> 

I don't think the 32-bit limitation is a problem for us, because we'd be
only ever build HLL on a sample, not the whole table. And the samples
are limited to 3M rows (with statistics target = 10k), so we're nowhere
near the scale requiring 64-bit hashes.

Presumably the statistics target value would determine the necessary HLL
parameters (and size), because e.g. with 30k rows we can't possibly see
more than 30k distinct values.

One possible problem is this all works only when all the columns are
analyzed at the same time / using the same sample. If you do this:

  ANALYZE t(a);
  ANALYZE t(b);

then HLL filters sketches for the columns would use different ctid/PK
values, and hence can't be combined.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, 16 May 2022 at 00:09, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 5/15/22 21:55, Matthias van de Meent wrote:
> > Note: I am not (currently) planning on implementing this rough idea,
> > just putting it up to share and document the idea, on request of Tomas
> > (cc-ed).
> >
> > The excellent pgconf.de presentation on PostgreSQL's extended
> > statistics system by Tomas Vondra [0] talked about how the current
> > default statistics assume the MCVs of columns to be fully independent,
> > i.e. values of column A do not imply any value of columns B and C, and
> > that for accurate data on correllated values the user needs to
> > manually create statistics on the combined columns (by either
> > STATISTICS or by INDEX).
> >
> > This is said to be due to limitations in our statistics collector: to
> > determine the fraction of the table that contains the value, we store
> > the N most common values with the fraction of their occurrance in the
> > table. This value is quite exact, but combining these values proves
> > difficult: there is nothing in the stored value that can confidently
> > include or exclude parts of the table from a predicate using that MCV,
> > so we can only assume that the values of two columns are independent.
> >
> > After the presentation it came to me that if we were to add an
> > estimator for the number of rows with that value to the MCV lists in
> > the form of HLL sketches (in addition to or replacing the current
> > most_common_elem_freqs fractions), we would be able to make better
> > estimates for multi-column filters by combining the HLL row
> > cardinality sketches for filters that filter on these MCVs. This would
> > remove the immediate need for manual statistics with an cartesian
> > product of the MCVs of those columns with their occurrance fractions,
> > which significantly reduces the need for the creation of manual
> > statistics - the need that exists due to planner mis-estimates in
> > correlated columns. Custom statistics will still be required for
> > expression statistics, but column correlation estimations _within
> > MCVs_ is much improved.
> >
> > How I imagine this would work is that for each value in the MCV, an
> > HLL is maintained that estimates the amount of distinct tuples
> > containing that value. This can be h(TID) or h(PK), or anything else
> > that would uniquely identify returned tuples. Because the keyspace of
> > all HLLs that are generated are on the same table, you can apply join
> > and intersection operations on the HLLs of the MCVs (for OR and
> > AND-operations respectively), and provide fairly accurately estimates
> > for the amount of tuples that would be returned by the filter on that
> > table.
> > > The required size of the HLL sketches can be determined by the amount
> > of tuples scanned during analyze, potentially reducing the size
> > required to store these HLL sketches from the usual 1.5kB per sketch
> > to something smaller - we'll only ever need to count nTuples distinct
> > values, so low values for default_statistics_target would allow for
> > smaller values for m in the HLL sketches, whilst still providing
> > fairly accurate result estimates.
> >
>
> I think it's an interesting idea. In principle it allows deducing the
> multi-column MCV for arbitrary combination of columns, not determined in
> advance. We'd have the MCV with HLL instead of frequencies for columns
> A, B and C:
>
> (a1, hll(a1))
> (a2, hll(a2))
> (...)
> (aK, hll(aK))
>
>
> (b1, hll(b1))
> (b2, hll(b2))
> (...)
> (bL, hll(bL))
>
> (c1, hll(c1))
> (c2, hll(c2))
> (...)
> (cM, hll(cM))
>
> and from this we'd be able to build MCV for any combination of those
> three columns.
>
> And in some sense it might even be more efficient/accurate, because the
> MCV on (A,B,C) might have up to K*L*M items. if there's 100 items in
> each column, that'd be 1,000,000 combinations, which we can't really
> store (target is up to 10k). And even if we could, it'd be 1M
> combinations with frequencies (so ~8-16B per combination).
>
> While with the MCV/HLL, we'd have 300 items and HLL. Assuming 256-512B
> HLL would be enough, that's still way smaller than the multi-column MCV.

HLLs for statistics_target=100 could use 4 bits per bucket, but any target above 218 should use 5 bits: nbits = ceil(log2(log2(target * 300))), and this saves only 20% on storage.

Accuracy increases with root(m), so while we can shrink the amount of buckets, this is only OK if we're accepting the corresponding decrease in accuracy.

> Even with target=10k it'd still be cheaper to store the separate MCV
> with HLL values, if I count right, and there'd be no items omitted from
> the MCV.

There are more options, though: Count-min was proposed as a replacement for MCV lists, and they work as guaranteed max count of distinct values. If, instead of an actual counter in each bucket, we would use HLL in each bucket, we'd not only have occurance estimates (but not: actual max-count limits) for all values, but we'd also be able to do the cross-column result correlation estimations.

> > Kind regards,
> >
> > Matthias van de Meent
> >
> > PS: Several later papers correctly point out that HLL can only count
> > up to 2^32 due to the use of a hash function that outputs only 32
> > bits; which is not enough for large tables. HLL++ solves this by using
> > a hash function that outputs 64 bits, and can thus be considered a
> > better alternative which provides the same features. But, any other
> > sketch that provides an accurate (but not necessarily: perfect)
> > count-distinct of which results can be combined should be fine as
> > well.
> >
>
> I don't think the 32-bit limitation is a problem for us, because we'd be
> only ever build HLL on a sample, not the whole table. And the samples
> are limited to 3M rows (with statistics target = 10k), so we're nowhere
> near the scale requiring 64-bit hashes.

I was thinking towards incremental statistics with this, which would imply that inserts and updates do trigger updates in the statistics. It is fairly inexpensive to append to an HLL, whereas that is not so trivial for tablefraction-based statistics.

> Presumably the statistics target value would determine the necessary HLL
> parameters (and size), because e.g. with 30k rows we can't possibly see
> more than 30k distinct values.
>
> One possible problem is this all works only when all the columns are
> analyzed at the same time / using the same sample. If you do this:
>
>   ANALYZE t(a);
>   ANALYZE t(b);
>
> then HLL filters sketches for the columns would use different ctid/PK
> values, and hence can't be combined.

Correct. This could be improved through deterministic random sample (as opposed to the current PRNG seeded with random()), where the sampled set is only pseudo-random and always the same for a given relation of constant size.

This would result in statistics that always overlap while the relation size is constant, and still show a corellation when the relation size changes (with correllation = (1 - %delta) * (1 - %delta) ). Only in the worst cases the correlation would be non-existent and the resulting combinations would be no different than random - effectively regressing the estimator function to P(A & B) = P(A) * P(B), which is the same as what we currently have.

Example: pages could be sampled in order of increasing value of hash(PageNo || relid). This hash can be anything, but a reversible hash function would probably be best, because it helps us by not having to sort nBlocks hashed values for large tables (we can run the equivalent of [0], which might be cheaper than top-n-sorting an array of relsize blocks). The resulting set of scanned data would be stable for any unchanging relation size and would thus consistently select the same blocks when you analyze the table.

Regards,

Matthias

PS. I was looking through papers on different types of sketches, and found that (according to some papers) HLL has some severe drawbacks regarding sketch intersection estimates. The Apache DataSketches [1] project states that it can provide accurate combined estimates through the Theta sketch [2][3] - an adaptation of the KMV-sketch that does seem to provide accurate union, intersection and 'A not B' -estimates, as well as exact results for low input row counts - something that HLL also doesn't work well for.

There is a patent covering this (US9152691, expected to expire in 2033), and the project containing the code is licenced under Apache 2.

[0] SELECT reverse_hash(n) FROM generate_series(0, MaxBlockNumber, 1) t(n) WHERE reverse_hash(n) < tablesize LIMIT n_sample_blocks"
[1] https://datasketches.apache.org/docs/Architecture/SketchFeaturesMatrix.html
[2] https://raw.githubusercontent.com/apache/datasketches-website/master/docs/pdf/ThetaSketchFramework.pdf
[3] https://raw.githubusercontent.com/apache/datasketches-website/master/docs/pdf/ThetaSketchEquations.pdf

On 5/19/22 19:59, Matthias van de Meent wrote:
> On Mon, 16 May 2022 at 00:09, Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>>
>> On 5/15/22 21:55, Matthias van de Meent wrote:
>> > Note: I am not (currently) planning on implementing this rough idea,
>> > just putting it up to share and document the idea, on request of Tomas
>> > (cc-ed).
>> >
>> > The excellent pgconf.de <http://pgconf.de> presentation on
> PostgreSQL's extended
>> > statistics system by Tomas Vondra [0] talked about how the current
>> > default statistics assume the MCVs of columns to be fully independent,
>> > i.e. values of column A do not imply any value of columns B and C, and
>> > that for accurate data on correllated values the user needs to
>> > manually create statistics on the combined columns (by either
>> > STATISTICS or by INDEX).
>> >
>> > This is said to be due to limitations in our statistics collector: to
>> > determine the fraction of the table that contains the value, we store
>> > the N most common values with the fraction of their occurrance in the
>> > table. This value is quite exact, but combining these values proves
>> > difficult: there is nothing in the stored value that can confidently
>> > include or exclude parts of the table from a predicate using that MCV,
>> > so we can only assume that the values of two columns are independent.
>> >
>> > After the presentation it came to me that if we were to add an
>> > estimator for the number of rows with that value to the MCV lists in
>> > the form of HLL sketches (in addition to or replacing the current
>> > most_common_elem_freqs fractions), we would be able to make better
>> > estimates for multi-column filters by combining the HLL row
>> > cardinality sketches for filters that filter on these MCVs. This would
>> > remove the immediate need for manual statistics with an cartesian
>> > product of the MCVs of those columns with their occurrance fractions,
>> > which significantly reduces the need for the creation of manual
>> > statistics - the need that exists due to planner mis-estimates in
>> > correlated columns. Custom statistics will still be required for
>> > expression statistics, but column correlation estimations _within
>> > MCVs_ is much improved.
>> >
>> > How I imagine this would work is that for each value in the MCV, an
>> > HLL is maintained that estimates the amount of distinct tuples
>> > containing that value. This can be h(TID) or h(PK), or anything else
>> > that would uniquely identify returned tuples. Because the keyspace of
>> > all HLLs that are generated are on the same table, you can apply join
>> > and intersection operations on the HLLs of the MCVs (for OR and
>> > AND-operations respectively), and provide fairly accurately estimates
>> > for the amount of tuples that would be returned by the filter on that
>> > table.
>> > > The required size of the HLL sketches can be determined by the amount
>> > of tuples scanned during analyze, potentially reducing the size
>> > required to store these HLL sketches from the usual 1.5kB per sketch
>> > to something smaller - we'll only ever need to count nTuples distinct
>> > values, so low values for default_statistics_target would allow for
>> > smaller values for m in the HLL sketches, whilst still providing
>> > fairly accurate result estimates.
>> >
>>
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
>>
>> And in some sense it might even be more efficient/accurate, because the
>> MCV on (A,B,C) might have up to K*L*M items. if there's 100 items in
>> each column, that'd be 1,000,000 combinations, which we can't really
>> store (target is up to 10k). And even if we could, it'd be 1M
>> combinations with frequencies (so ~8-16B per combination).
>>
>> While with the MCV/HLL, we'd have 300 items and HLL. Assuming 256-512B
>> HLL would be enough, that's still way smaller than the multi-column MCV.
> 
> HLLs for statistics_target=100 could use 4 bits per bucket, but any
> target above 218 should use 5 bits: nbits = ceil(log2(log2(target *
> 300))), and this saves only 20% on storage.
> 

I think the size estimate are somewhat misleading, as it ignores how
correlated the columns actually are. If they are strongly correlated,
there are probably much fewer combinations than the cartesian product.
That is, given two correlated columns with 100 items MCVs, the combined
MCV is likely much smaller than 10000 items.

And for non-correlated columns it doesn't really matter, because people
would not need to create the multi-column statistics at all.


> Accuracy increases with root(m), so while we can shrink the amount of
> buckets, this is only OK if we're accepting the corresponding decrease
> in accuracy.
> 

Hmm. So what's the rough size estimate in such case?

FWIW I think we shouldn't be focusing on the size too much - I don't see
this as a simple alternative to multi-column MCV. The main benefit is it
doesn't require creating such extended statistics in advance, and the
flexibility of combining arbitrary columns, I think.

>> Even with target=10k it'd still be cheaper to store the separate MCV
>> with HLL values, if I count right, and there'd be no items omitted from
>> the MCV.
> 
> There are more options, though: Count-min was proposed as a replacement
> for MCV lists, and they work as guaranteed max count of distinct values.
> If, instead of an actual counter in each bucket, we would use HLL in
> each bucket, we'd not only have occurance estimates (but not: actual
> max-count limits) for all values, but we'd also be able to do the
> cross-column result correlation estimations.
> 

Right. I did actually write a PoC using count-min sketch for join
estimates [1] about a year ago. My conclusion from that work was that CM
sketches are more a complement for MCVs than a replacement. That is, it
might be useful to have MCV and then a CM on the rows not represented by
the MCV.

[1]
https://www.postgresql.org/message-id/a08dda4c-aad4-a6b4-2cec-91363da73183%40enterprisedb.com


>> > Kind regards,
>> >
>> > Matthias van de Meent
>> >
>> > PS: Several later papers correctly point out that HLL can only count
>> > up to 2^32 due to the use of a hash function that outputs only 32
>> > bits; which is not enough for large tables. HLL++ solves this by using
>> > a hash function that outputs 64 bits, and can thus be considered a
>> > better alternative which provides the same features. But, any other
>> > sketch that provides an accurate (but not necessarily: perfect)
>> > count-distinct of which results can be combined should be fine as
>> > well.
>> >
>>
>> I don't think the 32-bit limitation is a problem for us, because we'd be
>> only ever build HLL on a sample, not the whole table. And the samples
>> are limited to 3M rows (with statistics target = 10k), so we're nowhere
>> near the scale requiring 64-bit hashes.
> 
> I was thinking towards incremental statistics with this, which would
> imply that inserts and updates do trigger updates in the statistics. It
> is fairly inexpensive to append to an HLL, whereas that is not so
> trivial for tablefraction-based statistics.
> 
>> Presumably the statistics target value would determine the necessary HLL
>> parameters (and size), because e.g. with 30k rows we can't possibly see
>> more than 30k distinct values.
>>
>> One possible problem is this all works only when all the columns are
>> analyzed at the same time / using the same sample. If you do this:
>>
>>   ANALYZE t(a);
>>   ANALYZE t(b);
>>
>> then HLL filters sketches for the columns would use different ctid/PK
>> values, and hence can't be combined.
> 
> Correct. This could be improved through deterministic random sample (as
> opposed to the current PRNG seeded with random()), where the sampled set
> is only pseudo-random and always the same for a given relation of
> constant size.
> 
> This would result in statistics that always overlap while the relation
> size is constant, and still show a corellation when the relation size
> changes (with correllation = (1 - %delta) * (1 - %delta) ). Only in the
> worst cases the correlation would be non-existent and the resulting
> combinations would be no different than random - effectively regressing
> the estimator function to P(A & B) = P(A) * P(B), which is the same as
> what we currently have.
> 
> Example: pages could be sampled in order of increasing value of
> hash(PageNo || relid). This hash can be anything, but a reversible hash
> function would probably be best, because it helps us by not having to
> sort nBlocks hashed values for large tables (we can run the equivalent
> of [0], which might be cheaper than top-n-sorting an array of relsize
> blocks). The resulting set of scanned data would be stable for any
> unchanging relation size and would thus consistently select the same
> blocks when you analyze the table.
> 

Not sure such deterministic sampling really solves the issue, because we
don't know what happened between the two statistics were built. Rows may
be moved due to UPDATE, new rows may be inserted to the pages, etc.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
> I think it's an interesting idea. In principle it allows deducing the
> multi-column MCV for arbitrary combination of columns, not determined in
> advance. We'd have the MCV with HLL instead of frequencies for columns
> A, B and C:
> 
> (a1, hll(a1))
> (a2, hll(a2))
> (...)
> (aK, hll(aK))
> 
> 
> (b1, hll(b1))
> (b2, hll(b2))
> (...)
> (bL, hll(bL))
> 
> (c1, hll(c1))
> (c2, hll(c2))
> (...)
> (cM, hll(cM))
> 
> and from this we'd be able to build MCV for any combination of those
> three columns.

Sorry, but I am lost here.  I read about HLL here:

    https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869

However, I don't see how they can be combined for multiple columns. 
Above, I know A,B,C are columns, but what is a1, a2, etc?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





On 5/25/22 00:16, Bruce Momjian wrote:
> On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
> 
> Sorry, but I am lost here.  I read about HLL here:
> 
>     https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
> 
> However, I don't see how they can be combined for multiple columns. 

It's the same as combining multiple HLL filters. HLL is essentially just
an array of counters, and to calculate a union (i.e. HLL for elements in
at least one of the input HLL sketches), you can just do Max() of the
counters. For intersection, you have to use inclusion-exclusion
principle, i.e.

    intersection(HLL1, HLL2)
        = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))

which is exactly the same as

    P(A & B) = P(A) + P(B) - P(A | B)

There's more in:

    https://github.com/citusdata/postgresql-hll

    https://agkn.wordpress.com/2012/12/17/hll-intersections-2/

which also mentions the weakness - error is proportional to the size of
the union, while the intersection may be much smaller. Which might be an
issue, especially when combining multiple columns.


> Above, I know A,B,C are columns, but what is a1, a2, etc?

a1 is a value in column A, common enough to make it into the MCV. But
instead of just a frequency, we store a HLL tracking unique rows (by
adding CTID to the HLL).

So for example for a "city" column, you'd have

("NY", HLL of CTIDs for rows with city = NY)
("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
...


etc.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, May 25, 2022 at 11:55:40AM +0200, Tomas Vondra wrote:
> It's the same as combining multiple HLL filters. HLL is essentially just
> an array of counters, and to calculate a union (i.e. HLL for elements in
> at least one of the input HLL sketches), you can just do Max() of the
> counters. For intersection, you have to use inclusion-exclusion
> principle, i.e.
> 
>     intersection(HLL1, HLL2)
>         = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))
> 
> which is exactly the same as
> 
>     P(A & B) = P(A) + P(B) - P(A | B)
> 
> There's more in:
> 
>     https://github.com/citusdata/postgresql-hll
> 
>     https://agkn.wordpress.com/2012/12/17/hll-intersections-2/
> 
> which also mentions the weakness - error is proportional to the size of
> the union, while the intersection may be much smaller. Which might be an
> issue, especially when combining multiple columns.
> 
> 
> > Above, I know A,B,C are columns, but what is a1, a2, etc?
> 
> a1 is a value in column A, common enough to make it into the MCV. But
> instead of just a frequency, we store a HLL tracking unique rows (by
> adding CTID to the HLL).
> 
> So for example for a "city" column, you'd have
> 
> ("NY", HLL of CTIDs for rows with city = NY)
> ("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
> ...

I read this email today and participated in an unconference session on
this topic today too.  Let me explain what I learned.

Currently we store 100 (default) of the most common values (MCV) for each
column, and a float4 of the percentage of rows that have that value. 
When we restrict multiple columns in a query, we multiply these
percentages to estimate the number of matching rows, e.g.:

    Philadelphia: 0.05
    USA:  0.15
    
    WHERE city = 'Philadelphia' and country = 'USA'
    estimate 0.05 * 0.15 = 0.0075

However, if we assume every "Philadelphia" is in the "USA", it should be
0.05, but we don't know that from the statistics.  We do have extended
statistics that allows us to create statistics on the combined
city/country columns.

The new idea here is to store a compressed bitmap of all of the sampled
rows that match the MCV, rather than just a percentage.  This would
allow us to AND the bitmaps of city/country and determine how many rows
actually match both restrictions.

        Philadelphia:           0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
        USA:                    0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1
    
    WHERE city = 'Philadelphia' and country = 'USA'
        Philadelphia & USA:     0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
    estimate 0.05

While hyper-log-log sketches could be used, a simpler compressed bitmap
might be sufficient.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Sorry for waking a dead thread, I had this in my drafts folder that I
was cleaning, and wanted to share this so anyone interested can reuse
these thoughts.

On Thu, 26 May 2022 at 03:19, Bruce Momjian <bruce@momjian.us> wrote:
> I read this email today and participated in an unconference session on
> this topic today too.  Let me explain what I learned.

My takeaways from this thread and that unconference session (other
notes from the session: [0]):

- Lossy count-distinct sketches require at least 100 "buckets" to get
a RSE of <10%, and 1000 buckets for RSE <2%.
The general formula for RSE for the most popular of these sketches is
within a constant factor of 1 / root(m) for m "buckets"- which is
theorized to be the theoretical limit for count-distinct sketches that
utilize n "buckets".
RSE is the residual statistical error, i.e. accuracy of the model, so
with the popular sketches to double the accuracy you need 4x as many
buckets.
A "bucket" is a distinct counting value, e.g. the log-counters in
(H)LL, and bits in HyperBitBit.
Most sketches use anywhere from several bits to several bytes per
bucket: HLL uses 5 and 6 bits for 32- and 64-bits hashes,
respectively,

- If we will implement sketches, it would be preferred if they support
the common set operations of [join, intersect, imply] while retaining
their properties, so that we don't lose the increased accuracy.
HLL does not support intersect- or imply-operations directly, which
makes it a bad choice as an estimator of rows returned.

- Bitmaps would be a good first implementation for an initial
sketch-based statistics implementation
Assuming the implementation would compress these bitmaps, the average
size would definitely not be larger than 900 bytes (+ administrative
overhead) per MCV entry - 3 bytes per sampled row for the index in the
bitmap * 300 rows on average per MCV entry. Rows would be identified
by their index in the sampled list of rows.

- It is not clear we need to be able to combine statistics from
multiple runs of ANALYZE.
\We considered that it is rare for people to analyze only a subset of
columns, or that people otherwise would need to combine analytics from
distinct table samples of the same table.

- Accurately combining statistics from two different runs of ANALYZE
requires stable sampling, and lossless tuple identification
The above-mentioned bitmap-based statistics would not allow this,
because the index of a sampled row will likely shift between runs,
even assuming that a row is shared in the sample.

Summary:
We shouldn't use HLL, (compressed) bitmaps will work fine for an
initial implementation of combined sketch-based MCV estimates.


Kind regards,

Matthias van de Meent

[0] https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Improving_Statistics_Accuracy