Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date
Msg-id Yo7VpxsV6ZecZ+At@momjian.us
Whole thread Raw
In response to Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
List pgsql-hackers
On Wed, May 25, 2022 at 11:55:40AM +0200, Tomas Vondra wrote:
> It's the same as combining multiple HLL filters. HLL is essentially just
> an array of counters, and to calculate a union (i.e. HLL for elements in
> at least one of the input HLL sketches), you can just do Max() of the
> counters. For intersection, you have to use inclusion-exclusion
> principle, i.e.
> 
>     intersection(HLL1, HLL2)
>         = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))
> 
> which is exactly the same as
> 
>     P(A & B) = P(A) + P(B) - P(A | B)
> 
> There's more in:
> 
>     https://github.com/citusdata/postgresql-hll
> 
>     https://agkn.wordpress.com/2012/12/17/hll-intersections-2/
> 
> which also mentions the weakness - error is proportional to the size of
> the union, while the intersection may be much smaller. Which might be an
> issue, especially when combining multiple columns.
> 
> 
> > Above, I know A,B,C are columns, but what is a1, a2, etc?
> 
> a1 is a value in column A, common enough to make it into the MCV. But
> instead of just a frequency, we store a HLL tracking unique rows (by
> adding CTID to the HLL).
> 
> So for example for a "city" column, you'd have
> 
> ("NY", HLL of CTIDs for rows with city = NY)
> ("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
> ...

I read this email today and participated in an unconference session on
this topic today too.  Let me explain what I learned.

Currently we store 100 (default) of the most common values (MCV) for each
column, and a float4 of the percentage of rows that have that value. 
When we restrict multiple columns in a query, we multiply these
percentages to estimate the number of matching rows, e.g.:

    Philadelphia: 0.05
    USA:  0.15
    
    WHERE city = 'Philadelphia' and country = 'USA'
    estimate 0.05 * 0.15 = 0.0075

However, if we assume every "Philadelphia" is in the "USA", it should be
0.05, but we don't know that from the statistics.  We do have extended
statistics that allows us to create statistics on the combined
city/country columns.

The new idea here is to store a compressed bitmap of all of the sampled
rows that match the MCV, rather than just a percentage.  This would
allow us to AND the bitmaps of city/country and determine how many rows
actually match both restrictions.

        Philadelphia:           0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
        USA:                    0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1
    
    WHERE city = 'Philadelphia' and country = 'USA'
        Philadelphia & USA:     0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
    estimate 0.05

While hyper-log-log sketches could be used, a simpler compressed bitmap
might be sufficient.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Remove support for Visual Studio 2013
Next
From: Tom Lane
Date:
Subject: Re: Authorizing select count()