Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog |
Date | |
Msg-id | Yo7VpxsV6ZecZ+At@momjian.us Whole thread Raw |
In response to | Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
|
List | pgsql-hackers |
On Wed, May 25, 2022 at 11:55:40AM +0200, Tomas Vondra wrote: > It's the same as combining multiple HLL filters. HLL is essentially just > an array of counters, and to calculate a union (i.e. HLL for elements in > at least one of the input HLL sketches), you can just do Max() of the > counters. For intersection, you have to use inclusion-exclusion > principle, i.e. > > intersection(HLL1, HLL2) > = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2)) > > which is exactly the same as > > P(A & B) = P(A) + P(B) - P(A | B) > > There's more in: > > https://github.com/citusdata/postgresql-hll > > https://agkn.wordpress.com/2012/12/17/hll-intersections-2/ > > which also mentions the weakness - error is proportional to the size of > the union, while the intersection may be much smaller. Which might be an > issue, especially when combining multiple columns. > > > > Above, I know A,B,C are columns, but what is a1, a2, etc? > > a1 is a value in column A, common enough to make it into the MCV. But > instead of just a frequency, we store a HLL tracking unique rows (by > adding CTID to the HLL). > > So for example for a "city" column, you'd have > > ("NY", HLL of CTIDs for rows with city = NY) > ("Philadephia", HLL of CTIDs for rows with city = Philadelphia) > ... I read this email today and participated in an unconference session on this topic today too. Let me explain what I learned. Currently we store 100 (default) of the most common values (MCV) for each column, and a float4 of the percentage of rows that have that value. When we restrict multiple columns in a query, we multiply these percentages to estimate the number of matching rows, e.g.: Philadelphia: 0.05 USA: 0.15 WHERE city = 'Philadelphia' and country = 'USA' estimate 0.05 * 0.15 = 0.0075 However, if we assume every "Philadelphia" is in the "USA", it should be 0.05, but we don't know that from the statistics. We do have extended statistics that allows us to create statistics on the combined city/country columns. The new idea here is to store a compressed bitmap of all of the sampled rows that match the MCV, rather than just a percentage. This would allow us to AND the bitmaps of city/country and determine how many rows actually match both restrictions. Philadelphia: 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 USA: 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1 WHERE city = 'Philadelphia' and country = 'USA' Philadelphia & USA: 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 estimate 0.05 While hyper-log-log sketches could be used, a simpler compressed bitmap might be sufficient. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
pgsql-hackers by date: