Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | 5721b874-35f7-89aa-6c17-4f9a0e8063d0@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 07/15/2018 04:43 PM, Dean Rasheed wrote: > On 15 July 2018 at 14:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> It's quite unclear to me how this algorithm could reliably end up with >> hist_sel=0 (in cases where we already don't end up with that). I mean, >> if a bucket matches the conditions, then the only way to eliminate is by >> deducing that MCV already contains all the matches - and that's rather >> difficult for complex clauses ... > > Ah, I didn't realise that you were using histograms for equality > clauses as well. I had assumed that they would only use the MCV stats, > as in the univariate case. Using histograms for equality seems > problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL, > as things stand that would end up with an estimate of half the > bucket's frequency, which seems excessive. Yeah, I think you're right - this is likely to produce over-estimates and needs rethinking. With top-level equality clauses it should be fairly trivial to use approach similar to the univariate case, i.e. computing ndistinct and using (1 - mcv_totalsel) / ndistinct If there are ndistinct coefficients this might be pretty good estimate of the non-MCV part, I think. But it only works for top-level equalities, not for complex clauses as in your examples. While looking at the statext_clauselist_selectivity code I think I see two more bugs: 1) the histogram_clauselist_selectivity() should probably use 'stat_clauses' and not 'clauses' 2) the early returns fail to estimate the neqclauses It's a bit too late here but I'll look at it tomorrow. > Also, if I'm reading it correctly, the code for histograms with > not-equals will return STATS_MATCH_PARTIAL for all but one of the > buckets, which isn't great either. > Ummm, why? > >> I don't know, really. I'll have to try hacking on this a bit I guess. >> But there's one obvious question - in what order should we add the >> clauses? Does it matter at all, or what is the optimal order? We don't >> need to worry about it now, because we simply consider all clauses at >> once, but I guess the proposed algorithm is more sensitive to this. > > I don't know. That's definitely one of the least satisfactory parts of > that idea. > > The alternative seems to be to improve the match tracking in your > current algorithm so that it keeps more detailed information about the > kinds of matches seen at each level, and combines them appropriately. > Maybe that's possible, but I'm struggling to see exactly how. Counting > equality clauses seen on each column might be a start. But it would > also need to track inequalities, with min/max values or fractions of > the non-MCV total, or some such thing. > Yeah, I agree, I'm not happy with this part either. But I'm grateful there's someone else thinking about the issues and proposing alternative approaches. Thanks for doing that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: