Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 0a2de724-ef48-b0b6-d2dd-7a6bde31f970@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers

On 07/16/2018 12:16 AM, Tomas Vondra wrote:
> 
> On 07/15/2018 04:43 PM, Dean Rasheed wrote:
>> On 15 July 2018 at 14:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>> It's quite unclear to me how this algorithm could reliably end up with
>>> hist_sel=0 (in cases where we already don't end up with that). I mean,
>>> if a bucket matches the conditions, then the only way to eliminate is by
>>> deducing that MCV already contains all the matches - and that's rather
>>> difficult for complex clauses ...
>>
>> Ah, I didn't realise that you were using histograms for equality
>> clauses as well. I had assumed that they would only use the MCV stats,
>> as in the univariate case. Using histograms for equality seems
>> problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
>> as things stand that would end up with an estimate of half the
>> bucket's frequency, which seems excessive.
> 
> Yeah, I think you're right - this is likely to produce over-estimates
> and needs rethinking. With top-level equality clauses it should be
> fairly trivial to use approach similar to the univariate case, i.e.
> computing ndistinct and using
> 
>      (1 - mcv_totalsel) / ndistinct
> 
> If there are ndistinct coefficients this might be pretty good estimate
> of the non-MCV part, I think. But it only works for top-level
> equalities, not for complex clauses as in your examples.
> 

On further thought, it's a bit more complicated, actually. Firstly, we 
already do that when there's no histogram (as in your example), and 
clearly it does not help. I initially thought it's a mistake to use the 
histogram in this case, but I can think of cases where it helps a lot.

1) when the equality clauses match nothing

In this case we may not find any buckets possibly matching the 
combination of values, producing selectivity estimate 0.0. While by 
using 1/ndistinct would give us something else.

2) when there are equality and inequality clauses

Similarly to the previous case, the equality clauses are useful in 
eliminating some of the buckets.

Now, I agree estimating equality clauses using histogram is tricky, so 
perhaps what we should do is using them as "conditions" to eliminate 
histogram buckets, but use ndistinct to estimate the selectivity. That 
is something like this:

   P(a=1 & b=1 & c<10 & d>=100)
    = P(a=1 & b=1) * P(c<10 & d>=100 | a=1 & b=1)
    = 1/ndistinct(a,b) * P(c<10 & d>=100 | a=1 & b=1)

where the second part is estimated using the histogram.

Of course, this still only works for the top-level equality clauses :-(

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Finding database for pg_upgrade missing library
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists