Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | ec21ebff-67ff-3d80-462d-131bdde91b58@enterprisedb.com Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
Re: Additional improvements to extended statistics |
List | pgsql-hackers |
On 11/29/20 3:57 PM, Dean Rasheed wrote: >> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> Seems fine to me, although the "_opt_ext_stats" is rather cryptic. >>> AFAICS we use "_internal" for similar functions. >>> > > I have been thinking about this some more. The one part of this that I > still wasn't happy with was the way that base frequencies were used to > compute the selectivity correction to apply. As noted in [1], using > base frequencies in this way isn't really appropriate for clauses > combined using "OR". The reason is that an item's base frequency is > computed as the product of the per-column selectivities, so that (freq > - base_freq) is the right correction to apply for a set of clauses > combined with "AND", but it doesn't really work properly for clauses > combined with "OR". This is why a number of the estimates in the > regression tests end up being significant over-estimates. > > I speculated in [1] that we might fix that by tracking which columns > of the match bitmap actually matched the clauses being estimated, and > then only use those base frequencies. Unfortunately that would also > mean changing the format of the stats that we store, and so would be a > rather invasive change. > > It occurred to me though, that there is another, much more > straightforward way to do it. We can rewrite the "OR" clauses, and > turn them into "AND" clauses using the fact that > > P(A OR B) = P(A) + P(B) - P(A AND B) > > and then use the multivariate stats to estimate the P(A AND B) part in > the usual way. > OK, that seems quite reasonable. > Attached is the resulting patch doing it that way. The main change is > in the way that statext_mcv_clauselist_selectivity() works, combined > with a new function mcv_clause_selectivity_or() that does the > necessary MCV bitmap manipulations. > > Doing it this way also means that clausesel.c doesn't need to export > clauselist_selectivity_or(), and the new set of exported functions > seem a bit neater now. > Nice. I agree this looks way better than the version I hacked together. I wonder how much of the comment before clauselist_selectivity should move to clauselist_selectivity_ext - it does talk about range clauses and so on, but clauselist_selectivity does not really deal with that. But maybe that's just an implementation detail and it's better to keep the comment the way it is. I noticed this outdated comment: /* Always compute the selectivity using clause_selectivity */ s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo, Also, the comment at clauselist_selectivity_or seems to not follow the usual pattern, which I think is /* * function name * short one-sentence description * * ... longer description ... */ Those are fairly minor issues. I don't have any deeper objections, and it seems committable. Do you plan to do that sometime soon? > A handful of regression test results change, and in all cases except > one the new estimates are much better. One estimate is made worse, but > in that case we only have 2 sets of partial stats: > > SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0 > > with stats on (a,b) and (c,d) so it's not surprising that combining (a > = 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the > old MV stats estimate was more down to chance in this case. > Yeah, that's quite possible - we're multiplying two estimates, but there's a clear correlation. So it was mostly luck we had over-estimated the clauses before, which gave us higher product and thus accidentally better overall estimate. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: