Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date
Msg-id 76f80a24-fe89-c6f9-5e79-239a02e8a2b7@enterprisedb.com
Whole thread Raw
In response to Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog  (Bruce Momjian <bruce@momjian.us>)
Responses Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
List pgsql-hackers

On 5/25/22 00:16, Bruce Momjian wrote:
> On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
> 
> Sorry, but I am lost here.  I read about HLL here:
> 
>     https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
> 
> However, I don't see how they can be combined for multiple columns. 

It's the same as combining multiple HLL filters. HLL is essentially just
an array of counters, and to calculate a union (i.e. HLL for elements in
at least one of the input HLL sketches), you can just do Max() of the
counters. For intersection, you have to use inclusion-exclusion
principle, i.e.

    intersection(HLL1, HLL2)
        = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))

which is exactly the same as

    P(A & B) = P(A) + P(B) - P(A | B)

There's more in:

    https://github.com/citusdata/postgresql-hll

    https://agkn.wordpress.com/2012/12/17/hll-intersections-2/

which also mentions the weakness - error is proportional to the size of
the union, while the intersection may be much smaller. Which might be an
issue, especially when combining multiple columns.


> Above, I know A,B,C are columns, but what is a1, a2, etc?

a1 is a value in column A, common enough to make it into the MCV. But
instead of just a frequency, we store a HLL tracking unique rows (by
adding CTID to the HLL).

So for example for a "city" column, you'd have

("NY", HLL of CTIDs for rows with city = NY)
("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
...


etc.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Next
From: Tomas Vondra
Date:
Subject: Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)