Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: Additional improvements to extended statistics
Date
Msg-id CAEZATCWFFU1NYqJ3pRDohDJZaDbv4NWXMgyRtph8zomagzeZag@mail.gmail.com
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Additional improvements to extended statistics
Re: Additional improvements to extended statistics
List pgsql-hackers
On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I wonder how much of the comment before clauselist_selectivity should
> move to clauselist_selectivity_ext - it does talk about range clauses
> and so on, but clauselist_selectivity does not really deal with that.
> But maybe that's just an implementation detail and it's better to keep
> the comment the way it is.

I think it's better to keep it the way it is, so that the entirety of
what clauselist_selectivity() does (via clauselist_selectivity_ext())
can be read in one place, but I have added separate comments for the
new "_ext" functions to explain how they differ. That matches similar
examples elsewhere.


> I noticed this outdated comment:
>
>   /* Always compute the selectivity using clause_selectivity */
>   s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Updated.


> Also, the comment at clauselist_selectivity_or seems to not follow the
> usual pattern, which I think is
>
> /*
>  * function name
>  *      short one-sentence description
>  *
>  * ... longer description ...
>  */

Hmm, it seems OK to me. The first part is basically copied from
clauselist_selectivity(). The "longer description" doesn't really have
much more to say because it's much simpler than
clauselist_selectivity(), but it seems neater to keep the two roughly
in sync.


I've been hacking on this a bit more and attached is an updated
(hopefully final) version with some other comment improvements and
also a couple of other tweaks:

The previous version had duplicated code blocks that implemented the
same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
other_sel and total_sel, which was quite messy. So I refactored that
into a new function mcv_combine_selectivities(). About half the
comments from statext_mcv_clauselist_selectivity() then move over to
mcv_combine_selectivities(). I also updated the comments for
mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
explain how their outputs are expected to be used by
mcv_combine_selectivities(). That hopefully makes for a clean
separation of concerns, and makes it easier to tweak the way MCV stats
are applied on top of simple stats, if someone thinks of a better
approach in the future.

In the previous version, for an ORed list of clauses, the MCV
correction was only applied to the overlaps between clauses. That's OK
as long as each clause only refers to a single column, since the
per-column statistics ought to be the best way to estimate each
individual clause in that case. However, if the individual clauses
refer to more than one column, I think the MCV correction should be
applied to each individual clause as well as to the overlaps. That
turns out to be pretty straightforward, since we're already doing all
the hard work of computing the match bitmap for each clause. The sort
of queries I had in mind were things like this:

  WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

I added a new test case along those lines and the new estimates are
much better than they are without this patch, but not for the reason I
thought --- the old code consistently over-estimated queries like that
because it actually applied the MCV correction twice (once while
processing each AND list, via clause_selectivity(), called from
clauselist_selectivity_simple(), and once for the top-level OR clause,
contained in a single-element implicitly-ANDed list). The way the new
code is structured avoids any kind of double application of extended
stats, producing a much better estimate, which is good.

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch. I have left a
regression test in place, whose estimates ought to be improved by such
a fix.

The upshot of all that is that the new code that applies the MCV
correction to the individual clauses in an ORed list doesn't help with
queries like the one above at the moment, and it's not obvious whether
it is currently reachable, but I think it's worth leaving in because
it seems more principled, and makes that code more future-proof. I
also think it's neater because now the signature of
mcv_clause_selectivity_or() is more natural --- it's primary return
value is now the clause's MCV selectivity, as suggested by the
function's name, rather than the overlap selectivity that the previous
version was returning. Also, after your "Var Op Var" patch is applied,
I think it would be possible to construct queries that would benefit
from this, so it would be good to get that committed too.

Barring any further comments, I'll push this sometime soon.

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting
Next
From: Heikki Linnakangas
Date:
Subject: Deprecate custom encoding conversions