Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers

From Mahmoud Sakr
Subject Re: Implement missing join selectivity estimation for range types
Date
Msg-id CAB4o4aud47V_iRyWtA8+ZAmdXDjCF165R73AeCjx2RL0nzQzHA@mail.gmail.com
Whole thread Raw
In response to Re: Implement missing join selectivity estimation for range types  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Implement missing join selectivity estimation for range types
List pgsql-hackers
Hi Tomas,
Thanks for picking up the patch and for the interesting discussions that
you bring !

> Interesting. Are there any particular differences compared to how we
> estimate for example range clauses on regular columns?
The theory is the same for scalar types. Yet, the statistics that are currently
collected for scalar types include other synopsis than the histogram, such
as MCV, which should be incorporated in the estimation. The theory for using
the additional statistics is ready in the paper, but we didn't yet implement it.
We thought of sharing the ready part, till the time allows us to implement the
rest, or other developers continue it.

> Right. I think 0.5% is roughly expected for the default statistics
> target, which creates 100 histogram bins, each representing ~1% of the
> values. Which on average means ~0.5% error.
Since this patch deals with join selectivity, we are then crossing 100*100 bins.
The ~0.5% error estimation comes from our experiments, rather than a
mathematical analysis.

> independence means we take (P1*P2). But maybe we should be very
> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>
> Another option is to use the length histogram, right? I mean, we know
> what the average length is, and it should be possible to use that to
> calculate how "far" ranges in a histogram can overlap.
The independence assumption exists if we use the lower and upper
histograms. It equally exists if we use the lower and length histograms.
In both cases, the link between the two histograms is lost during their
construction.
You discussion brings an interesting trade-off of optimistic v.s. pessimistic
estimations. A typical way to deal with such a trade-off is to average the
two, for example is model validation in machine learning, Do you think we
should implement something like
average( (P1*P2), Min(P1,P2) )?

> OK. But ideally we'd cross-check elements of the two multiranges, no?

> So if the column(s) contain a couple very common (multi)ranges that make
> it into an MCV, we'll ignore those? That's a bit unfortunate, because
> those MCV elements are potentially the main contributors to selectivity.
Both ideas would require collecting more detailed statistics, for
example similar
to arrays. In this patch, we restricted ourselves to the existing statistics.


> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

--
Best regards
Mahmoud SAKR

On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
>
> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Parallelize correlated subqueries that execute within each worker
Next
From: Laurenz Albe
Date:
Subject: Re: document the need to analyze partitioned tables