Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | 20200318193128.53soruotqdqnlaoy@development Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
|
List | pgsql-hackers |
On Sun, Mar 15, 2020 at 12:37:37PM +0000, Dean Rasheed wrote: >On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote: >> > >> >Attached is a patch series rebased on top of the current master, after >> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch >> >to get rid of the code duplication, and barring objections I'll get it >> >committed shortly together with the two parts improving test coverage. >> >> I've pushed the two patches improving test coverage for functional >> dependencies and MCV lists, which seems mostly non-controversial. I'll >> wait a bit more with the two patches actually changing behavior (rebased >> version attached, to keep cputube happy). >> > >Patch 0001 looks to be mostly ready. Just a couple of final comments: > >+ if (is_or) >+ simple_sel = clauselist_selectivity_simple_or(root, >stat_clauses, varRelid, >+ jointype, >sjinfo, NULL, 1.0); >+ else > >Surely that should be passing 0.0 as the final argument, otherwise it >will always just return simple_sel = 1.0. > > >+ * >+ * XXX We can't multiply with current value, because for OR clauses >+ * we start with 0.0, so we simply assign to s1 directly. >+ */ >+ s = statext_clauselist_selectivity(root, clauses, varRelid, >+ jointype, sjinfo, rel, >+ &estimatedclauses, true); > >That final part of the comment is no longer relevant (variable s1 no >longer exists). Probably it could now just be deleted, since I think >there are sufficient comments elsewhere to explain what's going on. > >Otherwise it looks good, and I think this will lead to some very >worthwhile improvements. > Attached is a rebased patch series, addressing both those issues. I've been wondering why none of the regression tests failed because of the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to make the tests stable, all the MCV lists we use are "perfect" i.e. it represents 100% of the data. But this selectivity is used to compute selectivity only for the part not represented by the MCV list, i.e. it's not really used. I suppose we could add a test that would use larger MCV item, but I'm afraid that'd be inherently unstable :-( Another thing I was thinking about is the changes to the API. We need to pass information whether the clauses are connected by AND or OR to a number of places, and 0001 does that in two ways. For some functions it adds a new parameter (called is_or), and for other functiosn it creates a new copy of a function. So for example - statext_mcv_clauselist_selectivity - statext_clauselist_selectivity got the new flag, while e.g. clauselist_selectivity gets a new "copy" sibling called clauselist_selectivity_or. There were two reasons for not using flag. First, clauselist_selectivity and similar functions have to do very different stuff for these two cases, so it'd be just one huge if/else block. Second, minimizing breakage of third-party code - pretty much all the extensions I've seen only work with AND clauses, and call clauselist_selectivity. Adding a flag would break that code. (Also, there's a bit of laziness, because this was the simplest thing to do during development.) But I wonder if that's sufficient reason - maybe we should just add the flag in all cases. It might break some code, but the fix is trivial (add a false there). Opinions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: