Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 5721b874-35f7-89aa-6c17-4f9a0e8063d0@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
On 07/15/2018 04:43 PM, Dean Rasheed wrote:
> On 15 July 2018 at 14:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> It's quite unclear to me how this algorithm could reliably end up with
>> hist_sel=0 (in cases where we already don't end up with that). I mean,
>> if a bucket matches the conditions, then the only way to eliminate is by
>> deducing that MCV already contains all the matches - and that's rather
>> difficult for complex clauses ...
> 
> Ah, I didn't realise that you were using histograms for equality
> clauses as well. I had assumed that they would only use the MCV stats,
> as in the univariate case. Using histograms for equality seems
> problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
> as things stand that would end up with an estimate of half the
> bucket's frequency, which seems excessive.

Yeah, I think you're right - this is likely to produce over-estimates
and needs rethinking. With top-level equality clauses it should be
fairly trivial to use approach similar to the univariate case, i.e.
computing ndistinct and using

    (1 - mcv_totalsel) / ndistinct

If there are ndistinct coefficients this might be pretty good estimate
of the non-MCV part, I think. But it only works for top-level
equalities, not for complex clauses as in your examples.

While looking at the statext_clauselist_selectivity code I think I see
two more bugs:

1) the histogram_clauselist_selectivity() should probably use
'stat_clauses' and not 'clauses'

2) the early returns fail to estimate the neqclauses

It's a bit too late here but I'll look at it tomorrow.

> Also, if I'm reading it correctly, the code for histograms with
> not-equals will return STATS_MATCH_PARTIAL for all but one of the
> buckets, which isn't great either.
> 

Ummm, why?

> 
>> I don't know, really. I'll have to try hacking on this a bit I guess.
>> But there's one obvious question - in what order should we add the
>> clauses? Does it matter at all, or what is the optimal order? We don't
>> need to worry about it now, because we simply consider all clauses at
>> once, but I guess the proposed algorithm is more sensitive to this.
> 
> I don't know. That's definitely one of the least satisfactory parts of
> that idea.
> 
> The alternative seems to be to improve the match tracking in your
> current algorithm so that it keeps more detailed information about the
> kinds of matches seen at each level, and combines them appropriately.
> Maybe that's possible, but I'm struggling to see exactly how. Counting
> equality clauses seen on each column might be a start. But it would
> also need to track inequalities, with min/max values or fractions of
> the non-MCV total, or some such thing.
> 

Yeah, I agree, I'm not happy with this part either. But I'm grateful
there's someone else thinking about the issues and proposing alternative
approaches. Thanks for doing that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Usage of epoch in txid_current
Next
From: Tom Lane
Date:
Subject: Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"