Thread: Implement missing join selectivity estimation for range types

Implement missing join selectivity estimation for range types

From
Mahmoud Sakr
Date:
Hi,
Given a query:
SELECT * FROM t1, t2 WHERE t1.r << t2.r
where t1.r, t2.r are of range type,
currently PostgreSQL will estimate a constant selectivity for the << predicate,
which is equal to 0.005, not utilizing the statistics that the optimizer
collects for range attributes.

We have worked out a theory for inequality join selectivity estimation
(http://arxiv.org/abs/2206.07396), and implemented it for range
types it in this patch.

The algorithm in this patch re-uses the currently collected statistics for
range types, which is the bounds histogram. It works fairly accurate for the
operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%.
The patch also implements selectivity estimation for the
operations @>, <@ (contains and is contained in), but their accuracy is not
stable, since the bounds histograms assume independence between the range
bounds. A point to discuss is whether or not to keep these last two operations.
The patch also includes the selectivity estimation for multirange types,
treating a multirange as a single range which is its bounding box.

The same algorithm in this patch is applicable to inequality joins of scalar
types. We, however, don't implement it for scalars, since more work is needed
to make use of the other statistics available for scalars, such as the MCV.
This is left as a future work.

--
Mahmoud SAKR - Univeristé Libre de Bruxelles
This work is done by Diogo Repas, Zhicheng Luo, Maxime Schoemans, and myself

Attachment

Re: Implement missing join selectivity estimation for range types

From
Tomas Vondra
Date:
Hello Mahmoud,

Thanks for the patch and sorry for not taking a look earlier.

On 6/30/22 16:31, Mahmoud Sakr wrote:
> Hi,
> Given a query:
> SELECT * FROM t1, t2 WHERE t1.r << t2.r
> where t1.r, t2.r are of range type,
> currently PostgreSQL will estimate a constant selectivity for the << predicate,
> which is equal to 0.005, not utilizing the statistics that the optimizer
> collects for range attributes.
> 
> We have worked out a theory for inequality join selectivity estimation
> (http://arxiv.org/abs/2206.07396), and implemented it for range
> types it in this patch.
> 

Interesting. Are there any particular differences compared to how we
estimate for example range clauses on regular columns?

> The algorithm in this patch re-uses the currently collected statistics for
> range types, which is the bounds histogram. It works fairly accurate for the
> operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%.

Right. I think 0.5% is roughly expected for the default statistics
target, which creates 100 histogram bins, each representing ~1% of the
values. Which on average means ~0.5% error.

> The patch also implements selectivity estimation for the
> operations @>, <@ (contains and is contained in), but their accuracy is not
> stable, since the bounds histograms assume independence between the range
> bounds. A point to discuss is whether or not to keep these last two operations.

That's a good question. I think the independence assumption is rather
foolish in this case, so I wonder if we could "stabilize" this by making
some different - less optimistic - assumption. Essentially, we have an
estimates for lower/upper boundaries:

  P1 = P(lower(var1) <= lower(var2))
  P2 = P(upper(var2) <= upper(var1))

and independence means we take (P1*P2). But maybe we should be very
pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?

Another option is to use the length histogram, right? I mean, we know
what the average length is, and it should be possible to use that to
calculate how "far" ranges in a histogram can overlap.

> The patch also includes the selectivity estimation for multirange types,
> treating a multirange as a single range which is its bounding box.
> 

OK. But ideally we'd cross-check elements of the two multiranges, no?

> The same algorithm in this patch is applicable to inequality joins of scalar
> types. We, however, don't implement it for scalars, since more work is needed
> to make use of the other statistics available for scalars, such as the MCV.
> This is left as a future work.
> 

So if the column(s) contain a couple very common (multi)ranges that make
it into an MCV, we'll ignore those? That's a bit unfortunate, because
those MCV elements are potentially the main contributors to selectivity.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implement missing join selectivity estimation for range types

From
Tomas Vondra
Date:
Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implement missing join selectivity estimation for range types

From
Mahmoud Sakr
Date:
Hi Tomas,
Thanks for picking up the patch and for the interesting discussions that
you bring !

> Interesting. Are there any particular differences compared to how we
> estimate for example range clauses on regular columns?
The theory is the same for scalar types. Yet, the statistics that are currently
collected for scalar types include other synopsis than the histogram, such
as MCV, which should be incorporated in the estimation. The theory for using
the additional statistics is ready in the paper, but we didn't yet implement it.
We thought of sharing the ready part, till the time allows us to implement the
rest, or other developers continue it.

> Right. I think 0.5% is roughly expected for the default statistics
> target, which creates 100 histogram bins, each representing ~1% of the
> values. Which on average means ~0.5% error.
Since this patch deals with join selectivity, we are then crossing 100*100 bins.
The ~0.5% error estimation comes from our experiments, rather than a
mathematical analysis.

> independence means we take (P1*P2). But maybe we should be very
> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>
> Another option is to use the length histogram, right? I mean, we know
> what the average length is, and it should be possible to use that to
> calculate how "far" ranges in a histogram can overlap.
The independence assumption exists if we use the lower and upper
histograms. It equally exists if we use the lower and length histograms.
In both cases, the link between the two histograms is lost during their
construction.
You discussion brings an interesting trade-off of optimistic v.s. pessimistic
estimations. A typical way to deal with such a trade-off is to average the
two, for example is model validation in machine learning, Do you think we
should implement something like
average( (P1*P2), Min(P1,P2) )?

> OK. But ideally we'd cross-check elements of the two multiranges, no?

> So if the column(s) contain a couple very common (multi)ranges that make
> it into an MCV, we'll ignore those? That's a bit unfortunate, because
> those MCV elements are potentially the main contributors to selectivity.
Both ideas would require collecting more detailed statistics, for
example similar
to arrays. In this patch, we restricted ourselves to the existing statistics.


> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

--
Best regards
Mahmoud SAKR

On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
>
> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Implement missing join selectivity estimation for range types

From
Tomas Vondra
Date:
On 1/18/23 20:23, Mahmoud Sakr wrote:
> Hi Tomas,
> Thanks for picking up the patch and for the interesting discussions that
> you bring !
> 
>> Interesting. Are there any particular differences compared to how we
>> estimate for example range clauses on regular columns?
>
> The theory is the same for scalar types. Yet, the statistics that are currently
> collected for scalar types include other synopsis than the histogram, such
> as MCV, which should be incorporated in the estimation. The theory for using
> the additional statistics is ready in the paper, but we didn't yet implement it.
> We thought of sharing the ready part, till the time allows us to implement the
> rest, or other developers continue it.
> 

I see. We don't have MCV stats for range types, so the algorithms don't
include that. But we have that for scalars, so the code would need to be
modified to consider that too.

However, I wonder how much could that improve the estimates for range
queries on scalar types. I mean, we already get pretty good estimates
for those, so I guess we wouldn't get much.

>> Right. I think 0.5% is roughly expected for the default statistics
>> target, which creates 100 histogram bins, each representing ~1% of the
>> values. Which on average means ~0.5% error.
> Since this patch deals with join selectivity, we are then crossing 100*100 bins.
> The ~0.5% error estimation comes from our experiments, rather than a
> mathematical analysis.
> 

Ah, understood. Even for joins there's probably a fairly close
relationship between the bin size and estimation error, but it's
certainly more complex.

BTW the experiments are those described in section 6 of the paper,
correct? I wonder how uniform (or skewed) the data was - in terms of
range length, etc. Or how it works for other operators, not just for
"<<" as in the paper.

>> independence means we take (P1*P2). But maybe we should be very
>> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>>
>> Another option is to use the length histogram, right? I mean, we know
>> what the average length is, and it should be possible to use that to
>> calculate how "far" ranges in a histogram can overlap.
> The independence assumption exists if we use the lower and upper
> histograms. It equally exists if we use the lower and length histograms.
> In both cases, the link between the two histograms is lost during their
> construction.
> You discussion brings an interesting trade-off of optimistic v.s. pessimistic
> estimations. A typical way to deal with such a trade-off is to average the
> two, for example is model validation in machine learning, Do you think we
> should implement something like
> average( (P1*P2), Min(P1,P2) )?
> 

I don't know.

AFAICS the independence assumption is used not only because it's very
cheap/simple to implement, but also because it actually is a reasonable
assumption for a fair number of data sets (particularly in OLTP).

You're right it's an optimistic estimate, but for many data sets  it's
actually quite reasonable.

I'm not sure that applies to range boundaries - the upper/lower bounds
seem pretty strongly correlated. So maybe using a more pessimistic
formula would be appropriate.

I was thinking the length histogram might allow an alternative,
approach, because it says what fraction of ranges has what length. So
for a "fixed" lower boundary, we may check each of those fractions. Of
course, this assumes consistent range length distribution (so if ranges
at one end are much longer, that won't work).

>> OK. But ideally we'd cross-check elements of the two multiranges, no?
> 
>> So if the column(s) contain a couple very common (multi)ranges that make
>> it into an MCV, we'll ignore those? That's a bit unfortunate, because
>> those MCV elements are potentially the main contributors to selectivity.
> Both ideas would require collecting more detailed statistics, for
> example similar
> to arrays. In this patch, we restricted ourselves to the existing statistics.
> 

Ah, I didn't realize we don't actually build MCV for range types. In
that case the current behavior makes perfect sense.

> 
>> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
>> a proper comment, not just "this is a copy from rangetypes".
> Right, the comment should elaborate more that the collected statistics are
> currently that same as rangetypes but may potentially deviate.
> 
>> However, it seems the two functions are exactly the same. Would the
>> functions diverge in the future? If not, maybe there should be just a
>> single shared function?
> Indeed, it is possible that the two functions will deviate if that statistics
> of multirange types will be refined.
> 

Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implement missing join selectivity estimation for range types

From
Tomas Vondra
Date:
Hi Mahmoud,

I finally had time to properly read the paper today - the general
approach mostly matches how I imagined the estimation would work for
inequalities, but it's definitely nice to see the algorithm properly
formalized and analyzed.

What seems a bit strange to me is that the patch only deals with range
types, leaving the scalar cases unchanged. I understand why (not having
a MCV simplifies it a lot), but I'd bet joins on range types are waaaay
less common than inequality joins on scalar types. I don't even remember
seeing inequality join on a range column, TBH.

That doesn't mean the patch is wrong, of course. But I'd expect users to
be surprised we handle range types better than "old" scalar types (which
range types build on, in some sense).

Did you have any plans to work on improving estimates for the scalar
case too? Or did you do the patch needed for the paper, and have no
plans to continue working on this?

I'm also wondering about not having MCV for ranges. I was a bit
surprised we don't build MCV in compute_range_stats(), and perhaps we
should start building those - if there are common ranges, this might
significantly improve some of the estimates (just like for scalar
columns). Which would mean the estimates for range types are just as
complex as for scalars. Of course, we don't do that now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implement missing join selectivity estimation for range types

From
Mahmoud Sakr
Date:
Hi Tomas,

> I finally had time to properly read the paper today - the general
> approach mostly matches how I imagined the estimation would work for
> inequalities, but it's definitely nice to see the algorithm properly
> formalized and analyzed.

Awesome, thanks for this interest!

> What seems a bit strange to me is that the patch only deals with range
> types, leaving the scalar cases unchanged. I understand why (not having
> a MCV simplifies it a lot), but I'd bet joins on range types are waaaay
> less common than inequality joins on scalar types. I don't even remember
> seeing inequality join on a range column, TBH.
>
> That doesn't mean the patch is wrong, of course. But I'd expect users to
> be surprised we handle range types better than "old" scalar types (which
> range types build on, in some sense).
>
> Did you have any plans to work on improving estimates for the scalar
> case too? Or did you do the patch needed for the paper, and have no
> plans to continue working on this?

I fully agree. Scalars are way more important.
I join you in the call for Diogo and Zhicheng to continue this implementation,
specially after the interest you show towards this patch. The current patch
was a course project (taught by me and Maxime), which was specific to range
types. But indeed the solution generalizes well to scalars. Hopefully after the
current exam session (Feb), there will be time to continue the implementation.
Nevertheless, it makes sense to do two separate patches: this one, and
the scalars. The code of the two patches is located in different files, and
the estimation algorithms have slight differences.

> I'm also wondering about not having MCV for ranges. I was a bit
> surprised we don't build MCV in compute_range_stats(), and perhaps we
> should start building those - if there are common ranges, this might
> significantly improve some of the estimates (just like for scalar
> columns). Which would mean the estimates for range types are just as
> complex as for scalars. Of course, we don't do that now.

Good question. Our intuition is that MCV will not be useful for ranges.
Maxime has done an experiment and confirmed this intuition. Here is his
experiment and explanation:
Create a table with 126000 int4ranges.
All ranges have their middle between 0 and 1000 and a length between 90
and 110.
The ranges are created in the following way:
- 10 different ranges, each duplicated 1000 times
- 20 different ranges, each duplicated 500 times
- 40 different ranges, each duplicated 100 times
- 200 different ranges, each duplicated 10 times
- 100000 different ranges, not duplicated
Two such tables (t1 and t2) were created in the same way but with different
data. Then he ran the following query:

EXPLAIN ANALYZE SELECT count(*)
FROM t, t2 WHERE t && t2

The results (using our patch) where the following:
Plan rows: 2991415662
Actual rows: 2981335423

So the estimation accuracy is still fairly good for such data with a lot
of repeated values, even without having MCV statistics.
The only error that can happen in our algorithms comes from the last bin in
which we assume uniform distribution of values. Duplicate values
(end bound, not ranges) might make this assumption be wrong, which would
create inaccurate estimations. However, this is still only incorrect
for the last
bin and all the others are correct.
MCV's are mainly useful for equality, which is not an operation we cover, and
probably not an important predicate for ranges. What do you think?

Best regards,
Mahmoud



Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
Hi Tomas,

As a quick update, the paper related to this work has finally been published in Mathematics (https://www.mdpi.com/2227-7390/11/6/1383).
During revision we also added a figure showing a comparison of our algorithm vs the existing algorithms in Oracle, SQL Server, MySQL and PostgreSQL, which can be found in the experiments section of the paper.
As can be seen, our algorithm outperforms even Oracle and SQL Server.

During this revision we also found a small bug, so we are working on a revision of the patch, which fixes this.

Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.

Regarding our previous discussion about the duplication of calc_hist_join_selectivity in rangetypes_selfuncs.c and multirangetypes_selfuncs.c, we can also remove this duplication in the revision if needed.
Note that currently, there are no external functions shared between rangetypes_selfuncs.c and multirangetypes_selfuncs.c.
Any function that was used in both files was duplicated as a static function.
The functions calc_hist_selectivity_scalar, calc_length_hist_frac, calc_hist_selectivity_contained and calc_hist_selectivity_contains are examples of this, where the function is identical but has been declared static in both files.
That said, we can remove the duplication of calc_hist_join_selectivity if it still needed.
We would, however, require some guidance as to where to put the external definition of this function, as there does not appear to be a rangetypes_selfuncs.h header.
Should it simply go into utils/selfuncs.h or should we create a new header file?

Best regards,
Maxime Schoemans

Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
Hi,

In the selectivity algorithm, the division was applied after adding the remaining histogram buckets of histogram2 that don't overlap with histogram1.
This could lead to reducing selectivity by half, e.g., in the case that histogram2 is completely right of histogram1.
The correct calculation is to divide by two before adding the remainder.
This patch-fix does the needed.

Best regards,
Maxime Schoemans

On 20/03/2023 16:34, maxime wrote:
Hi Tomas,

As a quick update, the paper related to this work has finally been published in Mathematics (https://www.mdpi.com/2227-7390/11/6/1383).
During revision we also added a figure showing a comparison of our algorithm vs the existing algorithms in Oracle, SQL Server, MySQL and PostgreSQL, which can be found in the experiments section of the paper.
As can be seen, our algorithm outperforms even Oracle and SQL Server.

During this revision we also found a small bug, so we are working on a revision of the patch, which fixes this.

Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.

Regarding our previous discussion about the duplication of calc_hist_join_selectivity in rangetypes_selfuncs.c and multirangetypes_selfuncs.c, we can also remove this duplication in the revision if needed.
Note that currently, there are no external functions shared between rangetypes_selfuncs.c and multirangetypes_selfuncs.c.
Any function that was used in both files was duplicated as a static function.
The functions calc_hist_selectivity_scalar, calc_length_hist_frac, calc_hist_selectivity_contained and calc_hist_selectivity_contains are examples of this, where the function is identical but has been declared static in both files.
That said, we can remove the duplication of calc_hist_join_selectivity if it still needed.
We would, however, require some guidance as to where to put the external definition of this function, as there does not appear to be a rangetypes_selfuncs.h header.
Should it simply go into utils/selfuncs.h or should we create a new header file?

Best regards,
Maxime Schoemans

Attachment

Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
This is a quick correction as the last patch contained a missing semicolon.

Regards,
Maxime Schoemans
Attachment

Re: Implement missing join selectivity estimation for range types

From
Damir Belyalov
Date:
Hello!

Thank you for the patch, very interesting article.
The patch doesn't apply to the current postgres version. Could you please update it?

Regards,
Damir Belyalov,
Postgres Professional

Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
Hi,

Thank you for picking up this patch.

 > The patch doesn't apply to the current postgres version. Could you 
please update it?
Indeed, the code was initially written on pg15.
You can find attached a new version of the patch that can be applied on 
the current master branch of postgres.

Please let us know if there is anything else we can do.

Best regards,
Maxime Schoemans
Attachment

Re: Implement missing join selectivity estimation for range types

From
Tom Lane
Date:
Schoemans Maxime <maxime.schoemans@ulb.be> writes:
> You can find attached a new version of the patch that can be applied on 
> the current master branch of postgres.

I took a brief look through this very interesting work.  I concur
with Tomas that it feels a little odd that range join selectivity
would become smarter than scalar inequality join selectivity, and
that we really ought to prioritize applying these methods to that
case.  Still, that's a poor reason to not take the patch.

I also agree with the upthread criticism that having two identical
functions in different source files will be a maintenance nightmare.
Don't do it.  When and if there's a reason for the behavior to
diverge between the range and multirange cases, it'd likely be
better to handle that by passing in a flag to say what to do.

But my real unhappiness with the patch as-submitted is the test cases,
which require rowcount estimates to be reproduced exactly.  We know
very well that ANALYZE estimates are not perfectly stable and tend to
vary across platforms.  As a quick check I tried the patch within a
32-bit VM, and it passed, which surprised me a bit ... but it would
surprise me a lot if we got these same numbers on every machine in
the buildfarm.  We need a more forgiving test method.  Usually the
approach is to set up a test case where the improved accuracy of
the estimate changes the planner's choice of plan compared to what
you got before, since that will normally not be too prone to change
from variations of a percent or two in the estimates.  Another idea
could be something like

    SELECT (estimate/actual BETWEEN 0.9 AND 1.1) AS ok FROM ...

which just gives a true/false output instead of an exact number.

            regards, tom lane



Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
On 14/11/2023 20:46, Tom Lane wrote:
> I took a brief look through this very interesting work.  I concur
> with Tomas that it feels a little odd that range join selectivity
> would become smarter than scalar inequality join selectivity, and
> that we really ought to prioritize applying these methods to that
> case.  Still, that's a poor reason to not take the patch.

Indeed, we started with ranges as this was the simpler case (no MCV) and 
was the topic of a course project.
The idea is to later write a second patch that applies these ideas to 
scalar inequality while also handling MCV's correctly.

> I also agree with the upthread criticism that having two identical
> functions in different source files will be a maintenance nightmare.
> Don't do it.  When and if there's a reason for the behavior to
> diverge between the range and multirange cases, it'd likely be
> better to handle that by passing in a flag to say what to do.

The duplication is indeed not ideal. However, there are already 8 other 
duplicate functions between the two files.
I would thus suggest to leave the duplication in this patch and create a 
second one that removes all duplication from the two files, instead of 
just removing the duplication for our new function.
What are your thoughts on this? If we do this, should the function 
definitions go in rangetypes.h or should we create a new 
rangetypes_selfuncs.h header?

> But my real unhappiness with the patch as-submitted is the test cases,
> which require rowcount estimates to be reproduced exactly.

> We need a more forgiving test method. Usually the
> approach is to set up a test case where the improved accuracy of
> the estimate changes the planner's choice of plan compared to what
> you got before, since that will normally not be too prone to change
> from variations of a percent or two in the estimates.

I have changed the test method to produce query plans for a 3-way range 
join.
The plans for the different operators differ due to the computed 
selectivity estimation, which was not the case before this patch.

Regards,
Maxime Schoemans
Attachment

Re: Implement missing join selectivity estimation for range types

From
Alena Rybakina
Date:
Hi!

Thank you for your work on the subject, I think it's a really useful 
feature and it allows optimizer to estimate more correctly clauses with 
such type of operator.

I rewieved your patch and noticed that some comments are repeated into 
multirangejoinsel functions, I suggest combining them.

The proposed changes are in the attached patch.


If this topic is about calculating selectivity, have you thought about 
adding cardinality calculation test for queries with this type of operator?

For example, you could form queries similar to those that you use in 
src/test/regress/sql/multirangetypes.sql and 
src/test/regress/sql/rangetypes.sql.

I added a few in the attached patch.

-- 
Regards,
Alena Rybakina

Attachment

Re: Implement missing join selectivity estimation for range types

From
vignesh C
Date:
On Tue, 21 Nov 2023 at 01:47, Schoemans Maxime <maxime.schoemans@ulb.be> wrote:
>
> On 14/11/2023 20:46, Tom Lane wrote:
> > I took a brief look through this very interesting work.  I concur
> > with Tomas that it feels a little odd that range join selectivity
> > would become smarter than scalar inequality join selectivity, and
> > that we really ought to prioritize applying these methods to that
> > case.  Still, that's a poor reason to not take the patch.
>
> Indeed, we started with ranges as this was the simpler case (no MCV) and
> was the topic of a course project.
> The idea is to later write a second patch that applies these ideas to
> scalar inequality while also handling MCV's correctly.
>
> > I also agree with the upthread criticism that having two identical
> > functions in different source files will be a maintenance nightmare.
> > Don't do it.  When and if there's a reason for the behavior to
> > diverge between the range and multirange cases, it'd likely be
> > better to handle that by passing in a flag to say what to do.
>
> The duplication is indeed not ideal. However, there are already 8 other
> duplicate functions between the two files.
> I would thus suggest to leave the duplication in this patch and create a
> second one that removes all duplication from the two files, instead of
> just removing the duplication for our new function.
> What are your thoughts on this? If we do this, should the function
> definitions go in rangetypes.h or should we create a new
> rangetypes_selfuncs.h header?
>
> > But my real unhappiness with the patch as-submitted is the test cases,
> > which require rowcount estimates to be reproduced exactly.
>
> > We need a more forgiving test method. Usually the
> > approach is to set up a test case where the improved accuracy of
> > the estimate changes the planner's choice of plan compared to what
> > you got before, since that will normally not be too prone to change
> > from variations of a percent or two in the estimates.
>
> I have changed the test method to produce query plans for a 3-way range
> join.
> The plans for the different operators differ due to the computed
> selectivity estimation, which was not the case before this patch.

One of the tests was aborted at [1], kindly post an updated patch for the same:
[04:55:42.797] src/tools/ci/cores_backtrace.sh linux /tmp/cores
[04:56:03.640] dumping /tmp/cores/postgres-6-24094.core for
/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin/postgres

[04:57:24.199] Core was generated by `postgres: old_node: postgres
regression [local] EXPLAIN '.
[04:57:24.199] Program terminated with signal SIGABRT, Aborted.
[04:57:24.199] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[04:57:24.199] Download failed: Invalid argument. Continuing without
source file ./signal/../sysdeps/unix/sysv/linux/raise.c.
[04:57:26.803]
[04:57:26.803] Thread 1 (Thread 0x7f121d9ec380 (LWP 24094)):
[04:57:26.803] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[04:57:26.803] set = {__val = {4194304, 0, 4636737291354636288,
4636737291354636288, 0, 0, 64, 64, 128, 128, 192, 192, 256, 256, 0,
0}}
[04:57:26.803] pid = <optimized out>
[04:57:26.803] tid = <optimized out>
[04:57:26.803] ret = <optimized out>
[04:57:26.803] #1 0x00007f122003d537 in __GI_abort () at abort.c:79
...
...
[04:57:38.774] #6 0x00007f357ad95788 in __asan::__asan_report_load1
(addr=addr@entry=107477261711120) at
../../../../src/libsanitizer/asan/asan_rtl.cpp:117
[04:57:38.774] bp = 140731433585840
[04:57:38.774] pc = <optimized out>
[04:57:38.774] local_stack = 139867680821632
[04:57:38.774] sp = 140731433585832
[04:57:38.774] #7 0x000055d5b155c37c in range_cmp_bound_values
(typcache=typcache@entry=0x629000030b60, b1=b1@entry=0x61c000017708,
b2=b2@entry=0x61c0000188b8) at rangetypes.c:2090
[04:57:38.774] No locals.
[04:57:38.774] #8 0x000055d5b1567bb2 in calc_hist_join_selectivity
(typcache=typcache@entry=0x629000030b60,
hist1=hist1@entry=0x61c0000188b8, nhist1=nhist1@entry=101,
hist2=hist2@entry=0x61c0000170b8, nhist2=nhist2@entry=101) at
rangetypes_selfuncs.c:1298
[04:57:38.774] i = 0
[04:57:38.774] j = 101
[04:57:38.774] selectivity = <optimized out>
[04:57:38.774] cur_sel1 = <optimized out>
[04:57:38.774] cur_sel2 = <optimized out>
[04:57:38.774] prev_sel1 = <optimized out>
[04:57:38.774] prev_sel2 = <optimized out>
[04:57:38.774] cur_sync = {val = <optimized out>, infinite =
<optimized out>, inclusive = <optimized out>, lower = <optimized out>}
[04:57:38.774] #9 0x000055d5b1569190 in rangejoinsel
(fcinfo=<optimized out>) at rangetypes_selfuncs.c:1495

[1] - https://cirrus-ci.com/task/5507789477380096

Regards,
Vignesh



Re: Implement missing join selectivity estimation for range types

From
Schoemans Maxime
Date:
On 05/01/2024 11:37, vignesh C wrote:
 > One of the tests was aborted at [1], kindly post an updated patch for 
the same:

Thank you for notifying us.
I believe I fixed the issue but it is hard to be certain as the issue 
did not arise when running the regression tests locally.

Regards,
Maxime
Attachment

Re: Implement missing join selectivity estimation for range types

From
vignesh C
Date:
On Fri, 5 Jan 2024 at 23:09, Schoemans Maxime <maxime.schoemans@ulb.be> wrote:
>
> On 05/01/2024 11:37, vignesh C wrote:
>  > One of the tests was aborted at [1], kindly post an updated patch for
> the same:
>
> Thank you for notifying us.
> I believe I fixed the issue but it is hard to be certain as the issue
> did not arise when running the regression tests locally.

I'm noticing this issue is not yet resolved, the CFBot is still
failing at [1] with:
#7 0x000055cddc25cd93 in range_cmp_bound_values
(typcache=typcache@entry=0x629000030b60, b1=b1@entry=0x61c000016f08,
b2=b2@entry=0x61c0000180b8) at rangetypes.c:2090
[19:55:02.591] No locals.
[19:55:02.591] #8 0x000055cddc2685c1 in calc_hist_join_selectivity
(typcache=typcache@entry=0x629000030b60,
hist1=hist1@entry=0x61c0000180b8, nhist1=nhist1@entry=101,
hist2=hist2@entry=0x61c0000168b8, nhist2=nhist2@entry=101) at
rangetypes_selfuncs.c:1295
[19:55:02.591] i = 0
[19:55:02.591] j = 101
[19:55:02.591] selectivity = 0
[19:55:02.591] prev_sel1 = -1
[19:55:02.591] prev_sel2 = 0
[19:55:02.591] #9 0x000055cddc269aaa in rangejoinsel
(fcinfo=<optimized out>) at rangetypes_selfuncs.c:1479
[19:55:02.591] root = <optimized out>
[19:55:02.591] operator = <optimized out>
[19:55:02.591] args = <optimized out>
[19:55:02.591] sjinfo = <optimized out>
[19:55:02.591] vardata1 = {var = <optimized out>, rel = <optimized
out>, statsTuple = <optimized out>, freefunc = <optimized out>,
vartype = <optimized out>, atttype = <optimized out>, atttypmod =
<optimized out>, isunique = <optimized out>, acl_ok = <optimized out>}
[19:55:02.591] vardata2 = {var = <optimized out>, rel = <optimized
out>, statsTuple = <optimized out>, freefunc = <optimized out>,
vartype = <optimized out>, atttype = <optimized out>, atttypmod =
<optimized out>, isunique = <optimized out>, acl_ok = <optimized out>}
[19:55:02.591] hist1 = {staop = <optimized out>, stacoll = <optimized
out>, valuetype = <optimized out>, values = <optimized out>, nvalues =
<optimized out>, numbers = <optimized out>, nnumbers = <optimized
out>, values_arr = <optimized out>, numbers_arr = <optimized out>}
[19:55:02.591] hist2 = {staop = <optimized out>, stacoll = <optimized
out>, valuetype = <optimized out>, values = <optimized out>, nvalues =
<optimized out>, numbers = <optimized out>, nnumbers = <optimized
out>, values_arr = <optimized out>, numbers_arr = <optimized out>}
[19:55:02.591] sslot = {staop = <optimized out>, stacoll = <optimized
out>, valuetype = <optimized out>, values = <optimized out>, nvalues =
<optimized out>, numbers = <optimized out>, nnumbers = <optimized
out>, values_arr = <optimized out>, numbers_arr = <optimized out>}
[19:55:02.591] reversed = <optimized out>
[19:55:02.591] selec = 0.001709375000000013
[19:55:02.591] typcache = 0x629000030b60
[19:55:02.591] stats1 = <optimized out>
[19:55:02.591] stats2 = <optimized out>
[19:55:02.591] empty_frac1 = 0
[19:55:02.591] empty_frac2 = 0
[19:55:02.591] null_frac1 = 0
[19:55:02.591] null_frac2 = 0
[19:55:02.591] nhist1 = 101
[19:55:02.591] nhist2 = 101
[19:55:02.591] hist1_lower = 0x61c0000168b8
[19:55:02.591] hist1_upper = 0x61c0000170b8
[19:55:02.591] hist2_lower = 0x61c0000178b8
[19:55:02.591] hist2_upper = 0x61c0000180b8
[19:55:02.591] empty = <optimized out>
[19:55:02.591] i = <optimized out>
[19:55:02.591] __func__ = "rangejoinsel"
[19:55:02.591] #10 0x000055cddc3b761f in FunctionCall5Coll
(flinfo=flinfo@entry=0x7ffc1628d710, collation=collation@entry=0,
arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888,
arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=<optimized
out>) at fmgr.c:1242
[19:55:02.591] fcinfodata = {fcinfo = {flinfo = <optimized out>,
context = <optimized out>, resultinfo = <optimized out>, fncollation =
<optimized out>, isnull = <optimized out>, nargs = <optimized out>,
args = 0x0}, fcinfo_data = {<optimized out> <repeats 112 times>}}
[19:55:02.591] fcinfo = 0x7ffc1628d5e0
[19:55:02.591] result = <optimized out>
[19:55:02.591] __func__ = "FunctionCall5Coll"
[19:55:02.591] #11 0x000055cddc3b92ee in OidFunctionCall5Coll
(functionId=8355, collation=collation@entry=0,
arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888,
arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=<optimized
out>) at fmgr.c:1460
[19:55:02.591] flinfo = {fn_addr = <optimized out>, fn_oid =
<optimized out>, fn_nargs = <optimized out>, fn_strict = <optimized
out>, fn_retset = <optimized out>, fn_stats = <optimized out>,
fn_extra = <optimized out>, fn_mcxt = <optimized out>, fn_expr =
<optimized out>}
[19:55:02.591] #12 0x000055cddbe834ae in join_selectivity
(root=root@entry=0x61d00017c218, operatorid=operatorid@entry=3888,
args=0x6210003bc5d8, inputcollid=0,
jointype=jointype@entry=JOIN_INNER,
sjinfo=sjinfo@entry=0x7ffc1628db30) at
../../../../src/include/postgres.h:324
[19:55:02.591] oprjoin = <optimized out>
[19:55:02.591] result = <optimized out>
[19:55:02.591] __func__ = "join_selectivity"
[19:55:02.591] #13 0x000055cddbd8c18c in clause_selectivity_ext
(root=root@entry=0x61d00017c218, clause=0x6210003bc678,
varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER,
sjinfo=sjinfo@entry=0x7ffc1628db30,
use_extended_stats=use_extended_stats@entry=true) at clausesel.c:841

I have changed the status to "Waiting on Author", feel free to post an
updated version, check CFBot and update the Commitfest entry
accordingly.

[1] - https://cirrus-ci.com/task/5698117824151552

Regards,
Vignesh



I cannot figure out why it aborts.

as Tom mentioned in upthread about the test cases.
similar to src/test/regress/sql/stats_ext.sql check_estimated_rows function.
we can test it by something:

create or replace function check_estimated_rows(text) returns table (ok bool)
language plpgsql as
$$
declare
    ln text;
    tmp text[];
    first_row bool := true;
begin
    for ln in
        execute format('explain analyze %s', $1)
    loop
        if first_row then
            first_row := false;
            tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
            return query select 0.2 < tmp[1]::float8 / tmp[2]::float8
and tmp[1]::float8 / tmp[2]::float8 < 5;
        end if;
    end loop;
end;
$$;

select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 && ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 << ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 >> ir2$$);

Do you need 3 tables to do the test? because we need to actually run
the query then compare the estimated row
and actually returned rows.
If you really execute the query with 3 table joins, it will take a lot of time.
So two tables join with where quql should be fine?

/* Fast-forwards i and j to start of iteration */
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);
+
+ /* Do the estimation on overlapping regions */
+ while (i < nhist1 && j < nhist2)
+ {
+ double cur_sel1,
+ cur_sel2;
+ RangeBound cur_sync;
+
+ if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) < 0)
+ cur_sync = hist1[i++];
+ else if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) > 0)
+ cur_sync = hist2[j++];
+ else
+ {
+ /* If equal, skip one */
+ cur_sync = hist1[i];
+

this part range_cmp_bound_values "if else if" part computed twice, you
can just do
`
int cmp;
cmp = range_cmp_bound_values(typcache, &hist1[i], &hist2[j]);
if cmp <0  then
else if cmp > 0 then
else then
`

also. I think you can put the following into  main while loop.
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);

split range and multirange into 2 patches might be a good idea.
seems: same function (calc_hist_join_selectivity) with same function
signature in src/backend/utils/adt/multirangetypes_selfuncs.c
and src/backend/utils/adt/rangetypes_selfuncs.c,
previously mail complaints not resolved.