Thread: Re: [HACKERS] Range Merge Join v1

Re: [HACKERS] Range Merge Join v1

From
Jeff Davis
Date:
Version 2 attached. Fixed a few issues, expanded tests, added docs.

A simple performance test (script attached) shows about a 5X
improvement when comparing against a nested loop with an inner
index-only scan over a gist index.

Even better, this doesn't require an index, so it will work even if
the input relations are subqueries.

Regards,
    Jeff Davis

On Thu, Apr 6, 2017 at 1:43 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> ========
> Example:
> ========
>
> Find different people using the same website at the same time:
>
> create table session(sessionid text, username text, during tstzrange);
> SELECT s1.username, s2.username, s1.during * s2.during
>   FROM session s1, session s2
>   WHERE s1.during && s2.during AND s1.username < s2.username
>
>
> =====================================
> Brief summary of previous discussion:
> =====================================
>
> http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis
>
> - can indexes solve it already (parameterized paths, etc.)?
> - planner complexity (track path keys, mergejoinable equality, etc.)
> - spatial join algorithm choice
> - compare range merge join against existing indexes to see if any
>   indexing approach is viable
>
> ==========
> Externals:
> ==========
>
> No new syntax or other externals. Just improved execution strategy for
> joining ranges using the overlaps operator (&&).
>
> New syntax is possible, but I don't see a strong reason to support
> special range join syntax at this time.
>
> =============
> Optimizer Design
> =============
>
> I took the path of least resistance, so if I am doing something wrong
> please let me know. I hardwired it to look for the range overlaps
> operator when checking if it could do a merge join, and then add
> range_ops to restrictinfo->mergeopfamilies.
>
> Then, I have to suppress adding it as an equivalence class, because
> overlaps is not equality. It still adds single-member ECs to
> restrictinfo->right_ec and left_ec, but (I think) that's OK.
>
> Also, I have to prevent generating paths for right/full join, because
> the range join algorithm can't (currently) handle those.
>
> =============
> Costing
> =============
>
> Needs more consideration. Seems to do reasonable things in the few
> examples I tried.
>
> =============
> Executor Design
> =============
>
> See detailed comments in nodeMergejoin.c
>
> =============
> Performance
> =============
>
> Seems much better than index nest loop join when the join is not
> selective. I will post detailed numbers soon.
>
> If no index is available, range merge join is the only reasonable way
> to execute a range join. For instance, if the inner is not a leaf in
> the plan.
>
> =============
> Alternatives:
> =============
>
> It was suggested that I approach it as a general spatial-join
> problem. That has merit, but I rejected it for now because the
> algorithm that I think has the most promise is range-oriented
> anyway. By that I mean we would need to extract all of the dimensions
> into ranges first, and then perform the join. With that in mind, it
> makes sense to implement range joins first; and then later provide the
> APIs to get at the spatial dimensions so that we can implement other
> spatial joins as range joins.
>
> Another suggestion was to base it on hash join, but I never understood
> that proposal and it didn't seem to map very well to a spatial join.
>
> Yet another suggestion was to use some kind of temporary index. Some
> brief experiments I did indicated that it would be fairly slow (though
> more investigation might be useful here). Also, it doesn't provide any
> alternative to the nestloop-with-inner-index we already offer at the
> leaf level today.
>
> Regards,
>      Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Range Merge Join v1

From
Jeff Davis
Date:
On Tue, Apr 11, 2017 at 12:17 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> Version 2 attached. Fixed a few issues, expanded tests, added docs.

It looks like the CF app only listed my perf test script. Re-attaching
rangejoin-v2.patch so that it appears in the CF app. Identical to
other rangejoin-v2.patch.

Regards,
     Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Range Merge Join v1

From
Andrew Borodin
Date:
Hi, Jeff!

I'm planning to provide the review for this patch in future. I've done
some performance tesing (see attachment).
All in all, nothing important, everything works fine.
Tests were executed under current HEAD on Ubuntu 16 over MacBook Air.

I observe that:
1. Patch brings performance improvement for specified query
2. Performance improvement of Range Merge Join vs GiST Nested Loop
Join (on Index Only Scan) is between 4x and 6x
3. Performance improvement of RMJ with B-tree index vs GiST is between
4x and 10x
(due to lack of actually competing cases, here I omit T-tests, they
are not that important when speaking about 4x difference)
4. Range Merge Join is capable to consume expressional index with
exactly same effect as direct index
5. Other attributes residing in joined tables do not affect
performance improvement

I'll make code review some time later, probably next week. (I hope
there is no urge in this, since commitfest is in July, or is it
urgent?) BTW fixe typo in index specification of original performance
test (both indexes were for same table)

Best regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Range Merge Join v1

From
Andrew Borodin
Date:
Hi, Jeff!

Sorry for being late. Actually, I had several unsuccessful attempts to
find something wrong with the patch.
Here's my review.

in pathkey.c

ecs = (EquivalenceClass **) palloc(nClauses * sizeof(EquivalenceClass *));
scores = (int *) palloc(nClauses * sizeof(int));
range_ecs = palloc(nClauses * sizeof(bool));

Third assignment has no cast.

And I have few questions:
1. Are there any types, which could benefit from Range Merge and are
not covered by this patch?
2. Can Range Merge handle merge of different ranges? Like int4range()
&& int8range() ?

My perf test script from the previous message was broken, here's fixed
one in the attachment.

This patch implements feature, contains new tests and passes old
tests, is documented and spec compliant. I do not see any reason why
not mark it "Ready for committer".

Best regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Range Merge Join v1

From
Jeff Davis
Date:
On Tue, May 30, 2017 at 11:44 PM, Andrew Borodin <borodin@octonica.com> wrote:
> Hi, Jeff!

Hi!

> Sorry for being late. Actually, I had several unsuccessful attempts to
> find something wrong with the patch.
> Here's my review.
>
> in pathkey.c
>
> ecs = (EquivalenceClass **) palloc(nClauses * sizeof(EquivalenceClass *));
> scores = (int *) palloc(nClauses * sizeof(int));
> range_ecs = palloc(nClauses * sizeof(bool));
>
> Third assignment has no cast.

Will fix.

> And I have few questions:
> 1. Are there any types, which could benefit from Range Merge and are
> not covered by this patch?

I thought about this for a while, and the only thing I can think of
are range joins that don't explicitly use range types.

> 2. Can Range Merge handle merge of different ranges? Like int4range()
> && int8range() ?

Right now, there aren't even casts between range types. I think the
best way to handle that at this point would be to add casts among the
numeric ranges. There may be advantages to supporting any two ranges
where the contained types are part of the same opfamily, but it seems
a little early to add that complication.

> My perf test script from the previous message was broken, here's fixed
> one in the attachment.
>
> This patch implements feature, contains new tests and passes old
> tests, is documented and spec compliant. I do not see any reason why
> not mark it "Ready for committer".

Great!

I think there are a couple more things that could be done if we want
to. Let me know if you think these things should be done now, or if
they should be a separate patch later when the need arises:

* Support for r1 @> r2 joins (join on "contains" rather than "overlaps").
* Better integration with the catalog so that users could add their
own types that support range merge join.

Thank you for the review.

Regards,    Jeff Davis



Re: [HACKERS] Range Merge Join v1

From
Andrew Borodin
Date:
2017-06-02 19:42 GMT+05:00 Jeff Davis <pgsql@j-davis.com>:
> On Tue, May 30, 2017 at 11:44 PM, Andrew Borodin <borodin@octonica.com> wrote:
>> 1. Are there any types, which could benefit from Range Merge and are
>> not covered by this patch?
>
> I thought about this for a while, and the only thing I can think of
> are range joins that don't explicitly use range types.

Let me try to write && in SQL

select * from a join b where (a.min<=b.max and a.min>=b.min) or
(a.max<=b.max and a.max>=b.min);

Quite complicated. Here user knows that min <= max, but DB don't. If
we write it precisely we get hell lot of permutations.
Here's what I think:
1. For me, this feature seems hard to implement.
2. This feature also will be hard to commit solely, since it's use
case will be rare.
3. If this could yield unexpected performance for queries like
select * from t where x<y and b>c or a!=z or [other conditions]
and optimizer could think: "Aha! I'll sort it and do it fast" it'd be cool.

I do not think range joins that don't explicitly use range types are
possible right now...

>> 2. Can Range Merge handle merge of different ranges? Like int4range()
>> && int8range() ?
>
> Right now, there aren't even casts between range types. I think the
> best way to handle that at this point would be to add casts among the
> numeric ranges. There may be advantages to supporting any two ranges
> where the contained types are part of the same opfamily, but it seems
> a little early to add that complication.
Agree.


> I think there are a couple more things that could be done if we want
> to. Let me know if you think these things should be done now, or if
> they should be a separate patch later when the need arises:
>
> * Support for r1 @> r2 joins (join on "contains" rather than "overlaps").
> * Better integration with the catalog so that users could add their
> own types that support range merge join.

1. I believe this changes can be incremental. They constitute value
for the end user, then they are committable. This patch is not overly
complicated, but it's easier to do this stuff in small pieces.
2. The commitfest is 3 months away. If you will have new versions of
the patch, I'll review again. Maybe will spot some new things :)

Best regards, Andrey Borodin, Octonica.