Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id adc47878-2927-2d96-3efc-71d283052a2d@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers

On 07/15/2018 11:36 AM, Dean Rasheed wrote:
> ...
>
> What I'm considering is an algorithm where we simultaneously compute 3 things:
> 
> simple_sel - The result we would get without multivariate stats (*)
> mcv_sel - The multivariate MCV result
> hist_sel - The multivariate histogram result
> 
> (*) except that at each stage where we add a new clause to the
> simple_sel value, we improve upon that estimate by factoring in a
> lower bound from the multivariate stats so far, so that even if the
> multivariate stats fail to generate anything at the end, we've managed
> to account for some of the non-independence of the columns.
> 
> If the MCV matches represented all the matches, then at the end we
> would have simple_sel = mcv_sel, and hist_sel = 0, and we'd be done.
> 

It's quite unclear to me how this algorithm could reliably end up with
hist_sel=0 (in cases where we already don't end up with that). I mean,
if a bucket matches the conditions, then the only way to eliminate is by
deducing that MCV already contains all the matches - and that's rather
difficult for complex clauses ...

> Otherwise, we'd have simple_sel >= mcv_sel, and a possibly non-zero
> hist_sel, but if we had managed to factor in both mcv_sel and hist_sel
> to simple_sel at each stage as we went along, then simple_sel is the
> best overall estimate we can return.
> 

Hmm. I may not be understanding the algorithm yet, but I find it hard to
believe applying the stats incrementally is going to produce reliably
better estimates that looking at all clauses at once. I understand
deducing upper/lower boundaries is useful, but I wonder if we could do
that somehow with the current algorithm.

> Perhaps this is not so very different from what you're currently
> doing, except that with this approach we might also end up with
> mcv_sel = 0 and hist_sel = 0, but still have an improved simple_sel
> estimate that had accounted for some the multivariate stats available
> along the way.
> 

I don't know, really. I'll have to try hacking on this a bit I guess.
But there's one obvious question - in what order should we add the
clauses? Does it matter at all, or what is the optimal order? We don't
need to worry about it now, because we simply consider all clauses at
once, but I guess the proposed algorithm is more sensitive to this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: [HACKERS] [PATCH] kNN for SP-GiST
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] Removing LEFT JOINs in more cases