Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 01a57d32-e442-b57f-6154-fd42129d6307@enterprisedb.com
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers
On 12/2/20 4:51 PM, Dean Rasheed wrote:
> On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I wonder how much of the comment before clauselist_selectivity should
>> move to clauselist_selectivity_ext - it does talk about range clauses
>> and so on, but clauselist_selectivity does not really deal with that.
>> But maybe that's just an implementation detail and it's better to keep
>> the comment the way it is.
> 
> I think it's better to keep it the way it is, so that the entirety of
> what clauselist_selectivity() does (via clauselist_selectivity_ext())
> can be read in one place, but I have added separate comments for the
> new "_ext" functions to explain how they differ. That matches similar
> examples elsewhere.
> 

+1

> 
>> I noticed this outdated comment:
>>
>>   /* Always compute the selectivity using clause_selectivity */
>>   s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
> 
> Updated.
> 
> 
>> Also, the comment at clauselist_selectivity_or seems to not follow the
>> usual pattern, which I think is
>>
>> /*
>>  * function name
>>  *      short one-sentence description
>>  *
>>  * ... longer description ...
>>  */
> 
> Hmm, it seems OK to me. The first part is basically copied from
> clauselist_selectivity(). The "longer description" doesn't really have
> much more to say because it's much simpler than
> clauselist_selectivity(), but it seems neater to keep the two roughly
> in sync.
> 

I see. In that case it's OK, I guess.

> 
> I've been hacking on this a bit more and attached is an updated
> (hopefully final) version with some other comment improvements and
> also a couple of other tweaks:
> 
> The previous version had duplicated code blocks that implemented the
> same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
> other_sel and total_sel, which was quite messy. So I refactored that
> into a new function mcv_combine_selectivities(). About half the
> comments from statext_mcv_clauselist_selectivity() then move over to
> mcv_combine_selectivities(). I also updated the comments for
> mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
> explain how their outputs are expected to be used by
> mcv_combine_selectivities(). That hopefully makes for a clean
> separation of concerns, and makes it easier to tweak the way MCV stats
> are applied on top of simple stats, if someone thinks of a better
> approach in the future.
> 
> In the previous version, for an ORed list of clauses, the MCV
> correction was only applied to the overlaps between clauses. That's OK
> as long as each clause only refers to a single column, since the
> per-column statistics ought to be the best way to estimate each
> individual clause in that case. However, if the individual clauses
> refer to more than one column, I think the MCV correction should be
> applied to each individual clause as well as to the overlaps. That
> turns out to be pretty straightforward, since we're already doing all
> the hard work of computing the match bitmap for each clause. The sort
> of queries I had in mind were things like this:
> 
>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
> 
> I added a new test case along those lines and the new estimates are
> much better than they are without this patch, but not for the reason I
> thought --- the old code consistently over-estimated queries like that
> because it actually applied the MCV correction twice (once while
> processing each AND list, via clause_selectivity(), called from
> clauselist_selectivity_simple(), and once for the top-level OR clause,
> contained in a single-element implicitly-ANDed list). The way the new
> code is structured avoids any kind of double application of extended
> stats, producing a much better estimate, which is good.
> 

Nice.

> However, the new code doesn't apply the extended stats directly using
> clauselist_selectivity_or() for this kind of query because there are
> no RestrictInfos for the nested AND clauses, so
> find_single_rel_for_clauses() (and similarly
> statext_is_compatible_clause()) regards those clauses as not
> compatible with extended stats. So what ends up happening is that
> extended stats are used only when we descend down to the two AND
> clauses, and their results are combined using the original "s1 + s2 -
> s1 * s2" formula. That actually works OK in this case, because there
> is no overlap between the two AND clauses, but it wouldn't work so
> well if there was.
> 
> I'm pretty sure that can be fixed by teaching
> find_single_rel_for_clauses() and statext_is_compatible_clause() to
> handle BoolExpr clauses, looking for RestrictInfos underneath them,
> but I think that should be left for a follow-in patch. I have left a
> regression test in place, whose estimates ought to be improved by such
> a fix.
> 

Yeah, I agree with leaving this for a separate patch. We can't do
everything at the same time.

> The upshot of all that is that the new code that applies the MCV
> correction to the individual clauses in an ORed list doesn't help with
> queries like the one above at the moment, and it's not obvious whether
> it is currently reachable, but I think it's worth leaving in because
> it seems more principled, and makes that code more future-proof. I
> also think it's neater because now the signature of
> mcv_clause_selectivity_or() is more natural --- it's primary return
> value is now the clause's MCV selectivity, as suggested by the
> function's name, rather than the overlap selectivity that the previous
> version was returning. Also, after your "Var Op Var" patch is applied,
> I think it would be possible to construct queries that would benefit
> from this, so it would be good to get that committed too.
> 
> Barring any further comments, I'll push this sometime soon.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: Re: proposal: unescape_text function
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting