Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | CAEZATCWFFU1NYqJ3pRDohDJZaDbv4NWXMgyRtph8zomagzeZag@mail.gmail.com Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Additional improvements to extended statistics
Re: Additional improvements to extended statistics |
List | pgsql-hackers |
On Sun, 29 Nov 2020 at 21:02, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I wonder how much of the comment before clauselist_selectivity should > move to clauselist_selectivity_ext - it does talk about range clauses > and so on, but clauselist_selectivity does not really deal with that. > But maybe that's just an implementation detail and it's better to keep > the comment the way it is. I think it's better to keep it the way it is, so that the entirety of what clauselist_selectivity() does (via clauselist_selectivity_ext()) can be read in one place, but I have added separate comments for the new "_ext" functions to explain how they differ. That matches similar examples elsewhere. > I noticed this outdated comment: > > /* Always compute the selectivity using clause_selectivity */ > s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo, Updated. > Also, the comment at clauselist_selectivity_or seems to not follow the > usual pattern, which I think is > > /* > * function name > * short one-sentence description > * > * ... longer description ... > */ Hmm, it seems OK to me. The first part is basically copied from clauselist_selectivity(). The "longer description" doesn't really have much more to say because it's much simpler than clauselist_selectivity(), but it seems neater to keep the two roughly in sync. I've been hacking on this a bit more and attached is an updated (hopefully final) version with some other comment improvements and also a couple of other tweaks: The previous version had duplicated code blocks that implemented the same MCV-correction algorithm using simple_sel, mcv_sel, base_sel, other_sel and total_sel, which was quite messy. So I refactored that into a new function mcv_combine_selectivities(). About half the comments from statext_mcv_clauselist_selectivity() then move over to mcv_combine_selectivities(). I also updated the comments for mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to explain how their outputs are expected to be used by mcv_combine_selectivities(). That hopefully makes for a clean separation of concerns, and makes it easier to tweak the way MCV stats are applied on top of simple stats, if someone thinks of a better approach in the future. In the previous version, for an ORed list of clauses, the MCV correction was only applied to the overlaps between clauses. That's OK as long as each clause only refers to a single column, since the per-column statistics ought to be the best way to estimate each individual clause in that case. However, if the individual clauses refer to more than one column, I think the MCV correction should be applied to each individual clause as well as to the overlaps. That turns out to be pretty straightforward, since we're already doing all the hard work of computing the match bitmap for each clause. The sort of queries I had in mind were things like this: WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2) I added a new test case along those lines and the new estimates are much better than they are without this patch, but not for the reason I thought --- the old code consistently over-estimated queries like that because it actually applied the MCV correction twice (once while processing each AND list, via clause_selectivity(), called from clauselist_selectivity_simple(), and once for the top-level OR clause, contained in a single-element implicitly-ANDed list). The way the new code is structured avoids any kind of double application of extended stats, producing a much better estimate, which is good. However, the new code doesn't apply the extended stats directly using clauselist_selectivity_or() for this kind of query because there are no RestrictInfos for the nested AND clauses, so find_single_rel_for_clauses() (and similarly statext_is_compatible_clause()) regards those clauses as not compatible with extended stats. So what ends up happening is that extended stats are used only when we descend down to the two AND clauses, and their results are combined using the original "s1 + s2 - s1 * s2" formula. That actually works OK in this case, because there is no overlap between the two AND clauses, but it wouldn't work so well if there was. I'm pretty sure that can be fixed by teaching find_single_rel_for_clauses() and statext_is_compatible_clause() to handle BoolExpr clauses, looking for RestrictInfos underneath them, but I think that should be left for a follow-in patch. I have left a regression test in place, whose estimates ought to be improved by such a fix. The upshot of all that is that the new code that applies the MCV correction to the individual clauses in an ORed list doesn't help with queries like the one above at the moment, and it's not obvious whether it is currently reachable, but I think it's worth leaving in because it seems more principled, and makes that code more future-proof. I also think it's neater because now the signature of mcv_clause_selectivity_or() is more natural --- it's primary return value is now the clause's MCV selectivity, as suggested by the function's name, rather than the overlap selectivity that the previous version was returning. Also, after your "Var Op Var" patch is applied, I think it would be possible to construct queries that would benefit from this, so it would be good to get that committed too. Barring any further comments, I'll push this sometime soon. Regards, Dean
Attachment
pgsql-hackers by date: