Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | 01a57d32-e442-b57f-6154-fd42129d6307@enterprisedb.com Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
|
List | pgsql-hackers |
On 12/2/20 4:51 PM, Dean Rasheed wrote: > On Sun, 29 Nov 2020 at 21:02, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I wonder how much of the comment before clauselist_selectivity should >> move to clauselist_selectivity_ext - it does talk about range clauses >> and so on, but clauselist_selectivity does not really deal with that. >> But maybe that's just an implementation detail and it's better to keep >> the comment the way it is. > > I think it's better to keep it the way it is, so that the entirety of > what clauselist_selectivity() does (via clauselist_selectivity_ext()) > can be read in one place, but I have added separate comments for the > new "_ext" functions to explain how they differ. That matches similar > examples elsewhere. > +1 > >> I noticed this outdated comment: >> >> /* Always compute the selectivity using clause_selectivity */ >> s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo, > > Updated. > > >> Also, the comment at clauselist_selectivity_or seems to not follow the >> usual pattern, which I think is >> >> /* >> * function name >> * short one-sentence description >> * >> * ... longer description ... >> */ > > Hmm, it seems OK to me. The first part is basically copied from > clauselist_selectivity(). The "longer description" doesn't really have > much more to say because it's much simpler than > clauselist_selectivity(), but it seems neater to keep the two roughly > in sync. > I see. In that case it's OK, I guess. > > I've been hacking on this a bit more and attached is an updated > (hopefully final) version with some other comment improvements and > also a couple of other tweaks: > > The previous version had duplicated code blocks that implemented the > same MCV-correction algorithm using simple_sel, mcv_sel, base_sel, > other_sel and total_sel, which was quite messy. So I refactored that > into a new function mcv_combine_selectivities(). About half the > comments from statext_mcv_clauselist_selectivity() then move over to > mcv_combine_selectivities(). I also updated the comments for > mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to > explain how their outputs are expected to be used by > mcv_combine_selectivities(). That hopefully makes for a clean > separation of concerns, and makes it easier to tweak the way MCV stats > are applied on top of simple stats, if someone thinks of a better > approach in the future. > > In the previous version, for an ORed list of clauses, the MCV > correction was only applied to the overlaps between clauses. That's OK > as long as each clause only refers to a single column, since the > per-column statistics ought to be the best way to estimate each > individual clause in that case. However, if the individual clauses > refer to more than one column, I think the MCV correction should be > applied to each individual clause as well as to the overlaps. That > turns out to be pretty straightforward, since we're already doing all > the hard work of computing the match bitmap for each clause. The sort > of queries I had in mind were things like this: > > WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2) > > I added a new test case along those lines and the new estimates are > much better than they are without this patch, but not for the reason I > thought --- the old code consistently over-estimated queries like that > because it actually applied the MCV correction twice (once while > processing each AND list, via clause_selectivity(), called from > clauselist_selectivity_simple(), and once for the top-level OR clause, > contained in a single-element implicitly-ANDed list). The way the new > code is structured avoids any kind of double application of extended > stats, producing a much better estimate, which is good. > Nice. > However, the new code doesn't apply the extended stats directly using > clauselist_selectivity_or() for this kind of query because there are > no RestrictInfos for the nested AND clauses, so > find_single_rel_for_clauses() (and similarly > statext_is_compatible_clause()) regards those clauses as not > compatible with extended stats. So what ends up happening is that > extended stats are used only when we descend down to the two AND > clauses, and their results are combined using the original "s1 + s2 - > s1 * s2" formula. That actually works OK in this case, because there > is no overlap between the two AND clauses, but it wouldn't work so > well if there was. > > I'm pretty sure that can be fixed by teaching > find_single_rel_for_clauses() and statext_is_compatible_clause() to > handle BoolExpr clauses, looking for RestrictInfos underneath them, > but I think that should be left for a follow-in patch. I have left a > regression test in place, whose estimates ought to be improved by such > a fix. > Yeah, I agree with leaving this for a separate patch. We can't do everything at the same time. > The upshot of all that is that the new code that applies the MCV > correction to the individual clauses in an ORed list doesn't help with > queries like the one above at the moment, and it's not obvious whether > it is currently reachable, but I think it's worth leaving in because > it seems more principled, and makes that code more future-proof. I > also think it's neater because now the signature of > mcv_clause_selectivity_or() is more natural --- it's primary return > value is now the clause's MCV selectivity, as suggested by the > function's name, rather than the overlap selectivity that the previous > version was returning. Also, after your "Var Op Var" patch is applied, > I think it would be possible to construct queries that would benefit > from this, so it would be good to get that committed too. > > Barring any further comments, I'll push this sometime soon. > +1 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: