Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id ec21ebff-67ff-3d80-462d-131bdde91b58@enterprisedb.com
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 11/29/20 3:57 PM, Dean Rasheed wrote:
>> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
>>> AFAICS we use "_internal" for similar functions.
>>>
> 
> I have been thinking about this some more. The one part of this that I
> still wasn't happy with was the way that base frequencies were used to
> compute the selectivity correction to apply. As noted in [1], using
> base frequencies in this way isn't really appropriate for clauses
> combined using "OR". The reason is that an item's base frequency is
> computed as the product of the per-column selectivities, so that (freq
> - base_freq) is the right correction to apply for a set of clauses
> combined with "AND", but it doesn't really work properly for clauses
> combined with "OR". This is why a number of the estimates in the
> regression tests end up being significant over-estimates.
> 
> I speculated in [1] that we might fix that by tracking which columns
> of the match bitmap actually matched the clauses being estimated, and
> then only use those base frequencies. Unfortunately that would also
> mean changing the format of the stats that we store, and so would be a
> rather invasive change.
> 
> It occurred to me though, that there is another, much more
> straightforward way to do it. We can rewrite the "OR" clauses, and
> turn them into "AND" clauses using the fact that
> 
>   P(A OR B) = P(A) + P(B) - P(A AND B)
> 
> and then use the multivariate stats to estimate the P(A AND B) part in
> the usual way.
> 

OK, that seems quite reasonable.

> Attached is the resulting patch doing it that way. The main change is
> in the way that statext_mcv_clauselist_selectivity() works, combined
> with a new function mcv_clause_selectivity_or() that does the
> necessary MCV bitmap manipulations.
> 
> Doing it this way also means that clausesel.c doesn't need to export
> clauselist_selectivity_or(), and the new set of exported functions
> seem a bit neater now.
> 

Nice. I agree this looks way better than the version I hacked together.

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I noticed this outdated comment:

  /* Always compute the selectivity using clause_selectivity */
  s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
 * function name
 *    short one-sentence description
 *
 * ... longer description ...
 */

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

> A handful of regression test results change, and in all cases except
> one the new estimates are much better. One estimate is made worse, but
> in that case we only have 2 sets of partial stats:
> 
>   SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0
> 
> with stats on (a,b) and (c,d) so it's not surprising that combining (a
> = 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
> old MV stats estimate was more down to chance in this case.
> 

Yeah, that's quite possible - we're multiplying two estimates, but
there's a clear correlation. So it was mostly luck we had over-estimated
the clauses before, which gave us higher product and thus accidentally
better overall estimate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Paul A Jungwirth
Date:
Subject: Re: range_agg
Next
From: Tomas Vondra
Date:
Subject: Re: Why does create_gather_merge_plan need make_sort?