Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers
From | Mahmoud Sakr |
---|---|
Subject | Re: Implement missing join selectivity estimation for range types |
Date | |
Msg-id | CAB4o4aud47V_iRyWtA8+ZAmdXDjCF165R73AeCjx2RL0nzQzHA@mail.gmail.com Whole thread Raw |
In response to | Re: Implement missing join selectivity estimation for range types (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Implement missing join selectivity estimation for range types
|
List | pgsql-hackers |
Hi Tomas, Thanks for picking up the patch and for the interesting discussions that you bring ! > Interesting. Are there any particular differences compared to how we > estimate for example range clauses on regular columns? The theory is the same for scalar types. Yet, the statistics that are currently collected for scalar types include other synopsis than the histogram, such as MCV, which should be incorporated in the estimation. The theory for using the additional statistics is ready in the paper, but we didn't yet implement it. We thought of sharing the ready part, till the time allows us to implement the rest, or other developers continue it. > Right. I think 0.5% is roughly expected for the default statistics > target, which creates 100 histogram bins, each representing ~1% of the > values. Which on average means ~0.5% error. Since this patch deals with join selectivity, we are then crossing 100*100 bins. The ~0.5% error estimation comes from our experiments, rather than a mathematical analysis. > independence means we take (P1*P2). But maybe we should be very > pessimistic and use e.g. Min(P1,P2)? Or maybe something in between? > > Another option is to use the length histogram, right? I mean, we know > what the average length is, and it should be possible to use that to > calculate how "far" ranges in a histogram can overlap. The independence assumption exists if we use the lower and upper histograms. It equally exists if we use the lower and length histograms. In both cases, the link between the two histograms is lost during their construction. You discussion brings an interesting trade-off of optimistic v.s. pessimistic estimations. A typical way to deal with such a trade-off is to average the two, for example is model validation in machine learning, Do you think we should implement something like average( (P1*P2), Min(P1,P2) )? > OK. But ideally we'd cross-check elements of the two multiranges, no? > So if the column(s) contain a couple very common (multi)ranges that make > it into an MCV, we'll ignore those? That's a bit unfortunate, because > those MCV elements are potentially the main contributors to selectivity. Both ideas would require collecting more detailed statistics, for example similar to arrays. In this patch, we restricted ourselves to the existing statistics. > Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs > a proper comment, not just "this is a copy from rangetypes". Right, the comment should elaborate more that the collected statistics are currently that same as rangetypes but may potentially deviate. > However, it seems the two functions are exactly the same. Would the > functions diverge in the future? If not, maybe there should be just a > single shared function? Indeed, it is possible that the two functions will deviate if that statistics of multirange types will be refined. -- Best regards Mahmoud SAKR On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs > a proper comment, not just "this is a copy from rangetypes". > > However, it seems the two functions are exactly the same. Would the > functions diverge in the future? If not, maybe there should be just a > single shared function? > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
pgsql-hackers by date: