Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | 20200321215946.epx3dfi74stlwiab@development Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
|
List | pgsql-hackers |
On Thu, Mar 19, 2020 at 07:08:07PM +0000, Dean Rasheed wrote: >On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Attached is a rebased patch series, addressing both those issues. >> >> I've been wondering why none of the regression tests failed because of >> the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to >> make the tests stable, all the MCV lists we use are "perfect" i.e. it >> represents 100% of the data. But this selectivity is used to compute >> selectivity only for the part not represented by the MCV list, i.e. it's >> not really used. I suppose we could add a test that would use larger >> MCV item, but I'm afraid that'd be inherently unstable :-( >> > >I think it ought to be possible to write stable tests for this code >branch -- I think all you need is for the number of rows to remain >small, so that the stats sample every row and are predictable, while >the MCVs don't cover all values, which can be achieved with a mix of >some common values, and some that only occur once. > >I haven't tried it, but it seems like it would be possible in principle. > Ah, right. Yeah, I think that should work. I thought there would be some volatility due to groups randomly not making it into the MCV list, but you're right it's possible to construct the data in a way to make it perfectly deterministic. So that's what I've done in the attached patch. >> Another thing I was thinking about is the changes to the API. We need to >> pass information whether the clauses are connected by AND or OR to a >> number of places, and 0001 does that in two ways. For some functions it >> adds a new parameter (called is_or), and for other functiosn it creates >> a new copy of a function. So for example >> >> - statext_mcv_clauselist_selectivity >> - statext_clauselist_selectivity >> >> got the new flag, while e.g. clauselist_selectivity gets a new "copy" >> sibling called clauselist_selectivity_or. >> >> There were two reasons for not using flag. First, clauselist_selectivity >> and similar functions have to do very different stuff for these two >> cases, so it'd be just one huge if/else block. Second, minimizing >> breakage of third-party code - pretty much all the extensions I've seen >> only work with AND clauses, and call clauselist_selectivity. Adding a >> flag would break that code. (Also, there's a bit of laziness, because >> this was the simplest thing to do during development.) >> >> But I wonder if that's sufficient reason - maybe we should just add the >> flag in all cases. It might break some code, but the fix is trivial (add >> a false there). >> >> Opinions? >> > >-1 > >I think of clause_selectivity() and clauselist_selectivity() as the >public API that everyone is using, whilst the functions that support >lists of clauses to be combined using OR are internal (to the planner) >implementation details. I think callers of public API tend to either >have implicitly AND'ed list of clauses, or a single OR clause. > OK, thanks. That was mostly my reasoning too - not wanting to cause unnecessary breakage. And yes, I agree most people just call clauselist_selectivity with a list of clauses combined using AND. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: