Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 20200318193128.53soruotqdqnlaoy@development
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers
On Sun, Mar 15, 2020 at 12:37:37PM +0000, Dean Rasheed wrote:
>On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
>> >
>> >Attached is a patch series rebased on top of the current master, after
>> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
>> >to get rid of the code duplication, and barring objections I'll get it
>> >committed shortly together with the two parts improving test coverage.
>>
>> I've pushed the two patches improving test coverage for functional
>> dependencies and MCV lists, which seems mostly non-controversial. I'll
>> wait a bit more with the two patches actually changing behavior (rebased
>> version attached, to keep cputube happy).
>>
>
>Patch 0001 looks to be mostly ready. Just a couple of final comments:
>
>+       if (is_or)
>+           simple_sel = clauselist_selectivity_simple_or(root,
>stat_clauses, varRelid,
>+                                                         jointype,
>sjinfo, NULL, 1.0);
>+       else
>
>Surely that should be passing 0.0 as the final argument, otherwise it
>will always just return simple_sel = 1.0.
>
>
>+        *
>+        * XXX We can't multiply with current value, because for OR clauses
>+        * we start with 0.0, so we simply assign to s1 directly.
>+        */
>+       s = statext_clauselist_selectivity(root, clauses, varRelid,
>+                                          jointype, sjinfo, rel,
>+                                          &estimatedclauses, true);
>
>That final part of the comment is no longer relevant (variable s1 no
>longer exists). Probably it could now just be deleted, since I think
>there are sufficient comments elsewhere to explain what's going on.
>
>Otherwise it looks good, and I think this will lead to some very
>worthwhile improvements.
>

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

   - statext_mcv_clauselist_selectivity
   - statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: WAL usage calculation patch
Next
From: Alvaro Herrera
Date:
Subject: Re: Minimal logical decoding on standbys