Thread: 9.3 Pre-proposal: Range Merge Join

9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

16 April 2012, 02:41:20

I hope this is not an inappropriate time for 9.3 discussions. The flip
side of asking for submissions in the first couple commitfests means
that I need to submit proposals now.

What is a Range Join?

See attached SQL for example. The key difference is that the join
condition is not equality, but overlaps (&&).

Problem statement: slow. Nested loops are the only option, although they
can benefit from an inner GiST index if available. But if the join is
happening up in the plan tree somewhere, then it's impossible for any
index to be available.

Proposed solution: a modified merge join that can handle ranges.

 1. Order the ranges on both sides by the lower bound, then upper bound.
Empty ranges can be excluded entirely.
 2. Left := first range on left, Right := first range on right
 3. If Left or Right is empty, terminate.
 4. If lower(Left) > upper(Right), discard Right, goto 2
 5. If lower(Right) > upper(Left), discard Left, goto 2
 6. return (Left, Right) as joined tuple
 7. Right := next range on right
 8. goto 3

If we get step 4 or step 5 keeps getting triggered, and a btree index is
available (ordered by lower bound), we can re-probe to go to the correct
position, and consider that the new top range on that side. This is an
optimization for the case where there are large numbers of ranges with
no match on the other side.

Thanks to Nathan Boley for helping me devise this algorithm. However,
any bugs are mine alone ;)

Weaknesses: I haven't thought through the optimization, but I suspect it
will be hard to be very accurate in the costing. That might be OK,
because there aren't very many options anyway, but I'll need to think it
through.

Questions:

 * Is this idea sane? -- that is, are ranges important enough that
people are willing to maintain a new operator?
 * The more general problem might be "spatial joins" which can operate
in N dimensions, and I doubt this would work very well in that case.
Does someone know of a spatial join algorithm (without IP claims) that
would be as good as this one for ranges?
 * Other thoughts?

Regards,
    Jeff Davis

Attachment

rangejoin.sql

Re: 9.3 Pre-proposal: Range Merge Join

From

Darren Duncan

Date:

16 April 2012, 03:18:49

Your proposal makes me think of something similar which might be useful, 
INclusion constraints.  As "exclusion constraints" might be thought of like a 
generalization of unique/key constraints, "inclusion constraints" are like a 
generalization of foreign key constraints.  The "inclusion constraints" 
basically allow some comparison operator other than is-equal to test if values 
in one table match values in another table, and the constraint allows the former 
if the test results in true.  An example of said inclusion test is whether the 
range in one table is contained in a range in another table.  I assume/hope 
that, similarly, now that we have range types in 9.2, that the existing 
"exclusion constraints" can be used with range comparison operators.

As to your actual proposal, it sounds like a generalization of the relational 
join or set intersection operator where instead of comparing sets defined in 
terms of an enumeration of discrete values we are comparing sets defined by a 
range, which conceptually have infinite values depending on the data type the 
range is defined over.  But if we're doing this, then it would seem to make 
sense to go further and see if we have set analogies for all of our relational 
or set operators, should we want to do work with non-discrete sets.

Now this sounds interesting in theory, but I would also assume that these could 
be implemented by an extension in terms of existing normal relational operators, 
where each range value is a discrete value, combined with operators for unioning 
or differencing etc ranges.  A relation of ranges effectively can represent a 
discontinuous range; in that case, the empty discontinuous range is also 
canonically representable by a relation with zero tuples.

Jeff, I get the impression your proposal is partly about helping performance by 
supporting this internally, rather than one just defining it as a SQL function, 
am I right?

-- Darren Duncan

Jeff Davis wrote:
> I hope this is not an inappropriate time for 9.3 discussions. The flip
> side of asking for submissions in the first couple commitfests means
> that I need to submit proposals now.
> 
> What is a Range Join?
> 
> See attached SQL for example. The key difference is that the join
> condition is not equality, but overlaps (&&).
> 
> Problem statement: slow. Nested loops are the only option, although they
> can benefit from an inner GiST index if available. But if the join is
> happening up in the plan tree somewhere, then it's impossible for any
> index to be available.
> 
> Proposed solution: a modified merge join that can handle ranges.
> 
>  1. Order the ranges on both sides by the lower bound, then upper bound.
> Empty ranges can be excluded entirely.
>  2. Left := first range on left, Right := first range on right
>  3. If Left or Right is empty, terminate.
>  4. If lower(Left) > upper(Right), discard Right, goto 2
>  5. If lower(Right) > upper(Left), discard Left, goto 2
>  6. return (Left, Right) as joined tuple
>  7. Right := next range on right
>  8. goto 3
> 
> If we get step 4 or step 5 keeps getting triggered, and a btree index is
> available (ordered by lower bound), we can re-probe to go to the correct
> position, and consider that the new top range on that side. This is an
> optimization for the case where there are large numbers of ranges with
> no match on the other side.
> 
> Thanks to Nathan Boley for helping me devise this algorithm. However,
> any bugs are mine alone ;)
> 
> Weaknesses: I haven't thought through the optimization, but I suspect it
> will be hard to be very accurate in the costing. That might be OK,
> because there aren't very many options anyway, but I'll need to think it
> through.
> 
> Questions:
> 
>  * Is this idea sane? -- that is, are ranges important enough that
> people are willing to maintain a new operator?
>  * The more general problem might be "spatial joins" which can operate
> in N dimensions, and I doubt this would work very well in that case.
> Does someone know of a spatial join algorithm (without IP claims) that
> would be as good as this one for ranges?
>  * Other thoughts?
> 
> Regards,
>     Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Heikki Linnakangas

Date:

16 April 2012, 03:29:54

On 16.04.2012 08:40, Jeff Davis wrote:
> Does someone know of a spatial join algorithm (without IP claims) that
> would be as good as this one for ranges?

I'd be happy with an algorithm that's specific to ranges, too, but my 
gut geeling is that there has to be a lot of research on spatial join 
algorithms out there. Implementing one of them would be more widely 
applicable, so I think we should look into spatial join algorithms first 
and see how far that gets us, before we write something specific to ranges.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: 9.3 Pre-proposal: Range Merge Join

From

Tom Lane

Date:

16 April 2012, 03:53:10

Jeff Davis <pgsql@j-davis.com> writes:
> Proposed solution: a modified merge join that can handle ranges.

>  1. Order the ranges on both sides by the lower bound, then upper bound.
> Empty ranges can be excluded entirely.
>  2. Left := first range on left, Right := first range on right
>  3. If Left or Right is empty, terminate.
>  4. If lower(Left) > upper(Right), discard Right, goto 2
>  5. If lower(Right) > upper(Left), discard Left, goto 2
>  6. return (Left, Right) as joined tuple
>  7. Right := next range on right
>  8. goto 3

This is surely not correct in detail.  As written, it will be impossible
for any tuple on the right side to be joined to more than one left-side
tuple.  You will need something analogous to the mark-and-restore
rewinding logic in standard mergejoin to make this work.

I can't escape the feeling that we could do this with more-or-less
standard mergejoin logic paying attention to some fuzzy notion of
equality, with the question of whether a given pair of ranges actually
overlap being deferred to a filter condition.  Not quite sure how to
make that work though.

>  * Is this idea sane? -- that is, are ranges important enough that
> people are willing to maintain a new operator?

Dunno.  It might be easier to sell the idea of adding support for range
joins in a couple of years, after we've seen how much use ranges get.

I will note that mergejoin is not only one of the more complicated
executor modules, but it accounts for a huge percentage of the planner's
complexity; which doesn't exactly make me sympathize with the notion of
let's-copy-and-paste-all-that.  It'd be a lot better if we could build
it as a small tweak to mergejoin rather than an independent thing.

(Having said that, I have no idea how the planner support could work
at all, because its treatment of mergejoins is intimately tied to
mergejoinable equality and EquivalenceClasses and PathKeys, which don't
seem readily extensible to the range situation.)
        regards, tom lane

Re: 9.3 Pre-proposal: Range Merge Join

From

Simon Riggs

Date:

16 April 2012, 04:48:48

On Mon, Apr 16, 2012 at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Dunno.  It might be easier to sell the idea of adding support for range
> joins in a couple of years, after we've seen how much use ranges get.

Once we've started the journey towards range types we must complete it
reasonably quickly.

Having partially implemented, unoptimised features is the same as not
having the feature at all, so it will remain unused until it really
works. We could say that about many features, but if Jeff is
championing this, I'd say go for it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3 Pre-proposal: Range Merge Join

From

Alexander Korotkov

Date:

16 April 2012, 09:23:20

On Mon, Apr 16, 2012 at 10:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

On 16.04.2012 08:40, Jeff Davis wrote:
Does someone know of a spatial join algorithm (without IP claims) that
would be as good as this one for ranges?

I'd be happy with an algorithm that's specific to ranges, too, but my gut geeling is that there has to be a lot of research on spatial join algorithms out there. Implementing one of them would be more widely applicable, so I think we should look into spatial join algorithms first and see how far that gets us, before we write something specific to ranges.

There is a good overview article about spatial joins.

http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf

It shows that there is a lot of methods based on building temporaty indexes. It might mean that building of temporary GiST index is not a bad idea.

Also there are methods which looks like extension of Jeff Davis proposal to the multidimensional case. It is Plane Sweep and External Plane Sweep methods. However, it might use sophisticated data structures for the "sweep". And I believe it should use it for good performance.

------
With best regards,
Alexander Korotkov.

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

16 April 2012, 14:44:09

On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> >  1. Order the ranges on both sides by the lower bound, then upper bound.
> > Empty ranges can be excluded entirely.
> >  2. Left := first range on left, Right := first range on right
> >  3. If Left or Right is empty, terminate.
> >  4. If lower(Left) > upper(Right), discard Right, goto 2
> >  5. If lower(Right) > upper(Left), discard Left, goto 2
> >  6. return (Left, Right) as joined tuple
> >  7. Right := next range on right
> >  8. goto 3
> 
> This is surely not correct in detail.  As written, it will be impossible
> for any tuple on the right side to be joined to more than one left-side
> tuple.  You will need something analogous to the mark-and-restore
> rewinding logic in standard mergejoin to make this work.

Every time you discard a tuple on the left, you go to step 2, which
rewinds the right side back to the first non-discarded tuple.

So, implemented using mark and restore:
 * start off with the marks at the first tuple on each side * "discard" means move the mark down a tuple * setting it
backto the first range means restoring to the mark * going to the next range (step 7) just means getting another
tuple,without changing the mark
 

Only one side really needs the mark and restore logic, but it was easier
to write the pseudocode in a symmetrical way (except step 7).

> I will note that mergejoin is not only one of the more complicated
> executor modules, but it accounts for a huge percentage of the planner's
> complexity; which doesn't exactly make me sympathize with the notion of
> let's-copy-and-paste-all-that.  It'd be a lot better if we could build
> it as a small tweak to mergejoin rather than an independent thing.
> 
> (Having said that, I have no idea how the planner support could work
> at all, because its treatment of mergejoins is intimately tied to
> mergejoinable equality and EquivalenceClasses and PathKeys, which don't
> seem readily extensible to the range situation.)

I think this is the more important problem area. It's clear that I'll
need to do some more investigation into the planning.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

16 April 2012, 17:12:34

On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote:

> There is a good overview article about spatial joins.
> http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf 

Thank you, that's exactly the kind of overview I was looking for.

> It shows that there is a lot of methods based on building temporaty
> indexes. It might mean that building of temporary GiST index is not a
> bad idea.

That had occurred to me, but I was hesitant to only use temp indexes. It
still doesn't really offer a good solution when both sides of the join
are relatively large (because of random I/O). Also the build speed of
the index would be more important than it is now.

On the other hand, if I tackle the problem using temp indexes, it would
be a more general solution useful for many problems (for instance, we
really need a better solution when the result of a subquery doesn't fit
in a work_mem-sized hashtable).

We may end up with some kind of combination (as the sweep seems to be;
see below).

> Also there are methods which looks like extension of Jeff Davis
> proposal to the multidimensional case. It is Plane Sweep and
> External Plane Sweep methods. However, it might use sophisticated data
> structures for the "sweep". And I believe it should use it for good
> performance.

That method looks closer to what I had in mind.

My Range Join proposal is essentially the same as the sweep method
except that it only joins on one dimension, and the rest of the
dimensions have to be checked with RECHECK. So, this still works for an
N-dimensional join, as long as a single dimension is fairly selective.

The sweep method seems to do the first dimension with the sweep method,
and maintains a changing index that it uses to do the lookup. In other
words, it's essentially just a way to do the RECHECK on the other
dimensions in less than O(n) time. So, if the sweep dimension is not
selective at all, this degenerates to the temporary index method (or
something close, anyway).

The fact that, in the sweep method, there is still one "special"
dimension, makes me think that it will be easier to make the API work
based on ranges (which is a big win because postgres already knows about
ranges). If so, that makes it easier to divide the project into stages,
because we could get it working for ranges and then extend it to support
other dimensions more efficiently later.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

16 April 2012, 17:50:51

On Sun, 2012-04-15 at 23:18 -0700, Darren Duncan wrote:
> Your proposal makes me think of something similar which might be useful, 
> INclusion constraints.  As "exclusion constraints" might be thought of like a 
> generalization of unique/key constraints, "inclusion constraints" are like a 
> generalization of foreign key constraints.  The "inclusion constraints" 
> basically allow some comparison operator other than is-equal to test if values 
> in one table match values in another table, and the constraint allows the former 
> if the test results in true.  An example of said inclusion test is whether the 
> range in one table is contained in a range in another table.  I assume/hope 
> that, similarly, now that we have range types in 9.2, that the existing 
> "exclusion constraints" can be used with range comparison operators.

Yes, Range Foreign Keys are one of the loose ends for Range Types that
I'd like to wrap up.

> As to your actual proposal, it sounds like a generalization of the relational 
> join or set intersection operator where instead of comparing sets defined in 
> terms of an enumeration of discrete values we are comparing sets defined by a 
> range, which conceptually have infinite values depending on the data type the 
> range is defined over.  But if we're doing this, then it would seem to make 
> sense to go further and see if we have set analogies for all of our relational 
> or set operators, should we want to do work with non-discrete sets.
> 
> Now this sounds interesting in theory, but I would also assume that these could 
> be implemented by an extension in terms of existing normal relational operators, 
> where each range value is a discrete value, combined with operators for unioning 
> or differencing etc ranges.  A relation of ranges effectively can represent a 
> discontinuous range; in that case, the empty discontinuous range is also 
> canonically representable by a relation with zero tuples.
> 
> Jeff, I get the impression your proposal is partly about helping performance by 
> supporting this internally, rather than one just defining it as a SQL function, 
> am I right?

Per my proposal, the problem statement is that it's "slow", so speeding
it up is really the entire proposal ;)

Extending the syntax might be interesting as well, but I don't have a
proposal for that.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Alexander Korotkov

Date:

16 April 2012, 18:20:28

On Tue, Apr 17, 2012 at 12:12 AM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote:

> There is a good overview article about spatial joins.
> http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf

Thank you, that's exactly the kind of overview I was looking for.

> It shows that there is a lot of methods based on building temporaty
> indexes. It might mean that building of temporary GiST index is not a
> bad idea.

That had occurred to me, but I was hesitant to only use temp indexes. It
still doesn't really offer a good solution when both sides of the join
are relatively large (because of random I/O). Also the build speed of
the index would be more important than it is now.

Note, that amount of random I/O during GiST index build significanlty dicreased after my GSoC project for buffering GiST index build. However, GiST index build is still quite CPU-expensive. This problem could be evaded by support of another methods of index build (another than producing a lot of penalty and picksplit calls). Hilbert curve and similar methods could help here. Implementation of such methods would require extension of GiST interface.

------
With best regards,
Alexander Korotkov.

Re: 9.3 Pre-proposal: Range Merge Join

From

Simon Riggs

Date:

16 April 2012, 18:44:32

On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:

> That had occurred to me, but I was hesitant to only use temp indexes. It
> still doesn't really offer a good solution when both sides of the join
> are relatively large (because of random I/O). Also the build speed of
> the index would be more important than it is now.

The thing I like most about temp indexes is that they needn't be temporary.

I'd like to see something along the lines of demand-created optional
indexes, that we reclaim space/maintenance overhead on according to
some cache management scheme. More space you have, the more of the
important ones hang around. The rough same idea applies to
materialised views.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3 Pre-proposal: Range Merge Join

From

Jay Levitt

Date:

16 April 2012, 18:53:15

Simon Riggs wrote:
> I'd like to see something along the lines of demand-created optional
> indexes, that we reclaim space/maintenance overhead on according to
> some cache management scheme. More space you have, the more of the
> important ones hang around. The rough same idea applies to
> materialised views.

+10; this sort of demand-driven optimization could be the database 
equivalent of Java's HotSpot (which accomplished the amazing task of making 
Java kinda fastish sometimes).

Jay

Re: 9.3 Pre-proposal: Range Merge Join

From

Robert Haas

Date:

16 April 2012, 19:31:07

On Apr 16, 2012, at 1:40 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> See attached SQL for example. The
> Problem statement: slow. Nested loops are the only option, although they
> can benefit from an inner GiST index if available. But if the join is
> happening up in the plan tree somewhere, then it's impossible for any
> index to be available.

Hmm. This sounds like something that Tom's recent work on parameterized plans ought to have fixed, or if not, it seems
closelyrelated. And by "this" I mean specifically the ability to use a GiST index to drive a nested loop that is higher
upin the plan tree than the immediate parent of the index scan. 

This is not an argument against your proposal, just an observation.

...Robert

Re: 9.3 Pre-proposal: Range Merge Join

From

Tom Lane

Date:

16 April 2012, 20:05:31

Robert Haas <robertmhaas@gmail.com> writes:
> On Apr 16, 2012, at 1:40 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> See attached SQL for example. The 
>> Problem statement: slow. Nested loops are the only option, although they
>> can benefit from an inner GiST index if available. But if the join is
>> happening up in the plan tree somewhere, then it's impossible for any
>> index to be available.

> Hmm. This sounds like something that Tom's recent work on
> parameterized plans ought to have fixed, or if not, it seems closely
> related.

Not really.  It's still going to be a nestloop, and as such not terribly
well suited for queries where there are a lot of matchable rows on both
sides.  The work I've been doing is really about making nestloops usable
in cases where join order restrictions formerly prevented it --- but
Jeff's complaint has nothing to do with that.  (This thought also makes
me a bit dubious about the nearby suggestions that more indexes will
fix it.)
        regards, tom lane

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

16 April 2012, 23:18:20

On Mon, 2012-04-16 at 22:20 +0100, Simon Riggs wrote:
> On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> 
> > That had occurred to me, but I was hesitant to only use temp indexes. It
> > still doesn't really offer a good solution when both sides of the join
> > are relatively large (because of random I/O). Also the build speed of
> > the index would be more important than it is now.
> 
> The thing I like most about temp indexes is that they needn't be temporary.
> 
> I'd like to see something along the lines of demand-created optional
> indexes, that we reclaim space/maintenance overhead on according to
> some cache management scheme. More space you have, the more of the
> important ones hang around. The rough same idea applies to
> materialised views.

I think to make an informed decision, the next thing I need to do is
hack up a prototype version of my merge join idea, and see how well it
performs against an index nestloop today.

It seems to me that both approaches may have merit independently.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Greg Stark

Date:

17 April 2012, 12:05:24

On Mon, Apr 16, 2012 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The thing I like most about temp indexes is that they needn't be temporary.
>
> I'd like to see something along the lines of demand-created optional
> indexes, that we reclaim space/maintenance overhead on according to
> some cache management scheme. More space you have, the more of the
> important ones hang around. The rough same idea applies to
> materialised views.

I find this a) really scary, b) a huge increase in scope, and c) kind
of pointless.

a) DDL and shared caches require all kinds of additional
synchronization that would be really annoying to be incurring in the
middle of DML queries unexpectedly. If these indexes are to be useful
for other transactions they would also impose a hidden and
uncontrollable cost on all updates and inserts on the table.

b) I think it would be a lot harder to get working cleanly. You would
need to invent a UI to control the lifetime and resources used by
these objects and deal with duplication between manually created and
dynamically created objects. It's fundamentally a huge shift in the
design of the database changing what things the user maintains as part
of their schema and what things the database maintains transparently.
This is a big project on its own aside from the technical side of
things.

c) If these indexes are useful for many queries then the user can
choose to create them themself. These indexes you're proposing would
be just as expensive as any those indexes and offer few advantages
aside from their automaticness. The same could be accomplished with
some external demon that just looked at the query logs and determined
which indexes to create or not.

The main advantage of creating dynamic indexes as part of the query
would be lost. Namely that these would be local data structures that
don't need to be synchronized with other transactions and don't need
to be updated by other transactions. They're just part of the query
execution the way spilled hash tables and tuplesort tapes are. You
could build indexes on materialized data resulting from earlier joins
or aggregates and so on.

The point is that if you make them equivalent to normal indexes just
dynamically maintained then all you've done is change the way normal
indexes work but haven't really changed the set of queries they're
useful for. That might be a neat UI Feature but If you want to change
the set of queries postgres can handle efficiently at all then you
need something that's fundamentally different from a table index.

As an aside I think some of what you're looking for could be better
handled with some kind of query result cache that could keep around
the materialized data from plan nodes until an event happens that
changes the data. This might be related to the underlying
infrastructure needed for materialized views but I haven't thought too
deeply about it. It seems like a lot of work and a big change from the
current very isolated per-process execution model.

-- 
greg

Re: 9.3 Pre-proposal: Range Merge Join

From

Robert Haas

Date:

17 April 2012, 15:03:33

On Mon, Apr 16, 2012 at 1:43 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql@j-davis.com> writes:
>> >  1. Order the ranges on both sides by the lower bound, then upper bound.
>> > Empty ranges can be excluded entirely.
>> >  2. Left := first range on left, Right := first range on right
>> >  3. If Left or Right is empty, terminate.
>> >  4. If lower(Left) > upper(Right), discard Right, goto 2
>> >  5. If lower(Right) > upper(Left), discard Left, goto 2
>> >  6. return (Left, Right) as joined tuple
>> >  7. Right := next range on right
>> >  8. goto 3
>>
>> This is surely not correct in detail.  As written, it will be impossible
>> for any tuple on the right side to be joined to more than one left-side
>> tuple.  You will need something analogous to the mark-and-restore
>> rewinding logic in standard mergejoin to make this work.
>
> Every time you discard a tuple on the left, you go to step 2, which
> rewinds the right side back to the first non-discarded tuple.
>
> So, implemented using mark and restore:
>
>  * start off with the marks at the first tuple on each side
>  * "discard" means move the mark down a tuple
>  * setting it back to the first range means restoring to the mark
>  * going to the next range (step 7) just means getting another
>    tuple, without changing the mark
>
> Only one side really needs the mark and restore logic, but it was easier
> to write the pseudocode in a symmetrical way (except step 7).

I'm actually not sure these are equivalent formulations.  Suppose one
side has [i,i] where i ranges from 1 to 10000000 and the other side
the exact same thing plus [1,10000000].  That one really big range
will come up second on the right side, and you will not be able to
discard it until you reach the end of the join.  If you just keep
rewinding the right side, you're going to end up with O(n^2) behavior,
whereas if you can discard tuples "from the middle" on the right side,
then you will get O(n) behavior, which is a big difference.  In other
words, in your original algorithm the tuples that you discard in step
4 or 5 need not be the first remaining tuple on whichever side of the
join we're talking about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.3 Pre-proposal: Range Merge Join

From

Robert Haas

Date:

17 April 2012, 15:24:21

On Mon, Apr 16, 2012 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm. This sounds like something that Tom's recent work on
>> parameterized plans ought to have fixed, or if not, it seems closely
>> related.
>
> Not really.  It's still going to be a nestloop, and as such not terribly
> well suited for queries where there are a lot of matchable rows on both
> sides.  The work I've been doing is really about making nestloops usable
> in cases where join order restrictions formerly prevented it --- but
> Jeff's complaint has nothing to do with that.  (This thought also makes
> me a bit dubious about the nearby suggestions that more indexes will
> fix it.)

I thought Jeff was parenthetically complaining about cases like A LEFT
JOIN (B INNER JOIN C ON b.y = c.y) ON a.x && b.x.  That presumably
would require the parameterized-path stuff to have any chance of doing
partial index scans over B.  However, I understand that's not the main
issue here.

One thing that I think needs some analysis is when the range join idea
is better or worse than a nested loop with inner index-scan, because
potentially those are the options the planner has to choose between,
and the costing model had better know enough to make the right thing
happen.  It strikes me that the nested loop with inner index-scan is
likely to be a win when there are large chunks of the indexed relation
that the nestloop never needs to visit at all - imagine small JOIN big
ON small.a && big.a, for example.  I suppose the really interesting
question is how much we can save when the entirety of both relations
has to be visited anyway - it seems promising, but I guess we won't
know for sure without testing it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.3 Pre-proposal: Range Merge Join

From

Tom Lane

Date:

18 April 2012, 02:22:02

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 16, 2012 at 1:43 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> ... Only one side really needs the mark and restore logic, but it was easier
>> to write the pseudocode in a symmetrical way (except step 7).

> I'm actually not sure these are equivalent formulations.  Suppose one
> side has [i,i] where i ranges from 1 to 10000000 and the other side
> the exact same thing plus [1,10000000].  That one really big range
> will come up second on the right side, and you will not be able to
> discard it until you reach the end of the join.  If you just keep
> rewinding the right side, you're going to end up with O(n^2) behavior,
> whereas if you can discard tuples "from the middle" on the right side,
> then you will get O(n) behavior, which is a big difference.  In other
> words, in your original algorithm the tuples that you discard in step
> 4 or 5 need not be the first remaining tuple on whichever side of the
> join we're talking about.

It would be a pretty weird implementation of mergejoin that could
"discard tuples from the middle" of an input stream.  Or to be more
specific, it wouldn't be the mergejoin itself that could do that at all
--- you'd need the input plan node to be some sort of tweaked version of
tuplestore or tuplesort that could respond to a request like that.

I can't escape the feeling that Jeff has chosen the wrong basis for his
thought experiment, and that what he really ought to be thinking about
is hashjoin, which keeps an in-memory table that it could easily modify
on the fly if it chose to.  The multi-batch aspect of hybrid hashjoin
could be useful too (IOW, when you're out of better ideas, throw the
tuple back in the water to process later).

This is just handwaving of course.  I think some digging in the
spatial-join literature would likely find ideas better than any of
these.
        regards, tom lane

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

19 April 2012, 03:26:00

On Tue, 2012-04-17 at 14:24 -0400, Robert Haas wrote:
> I thought Jeff was parenthetically complaining about cases like A LEFT
> JOIN (B INNER JOIN C ON b.y = c.y) ON a.x && b.x.  That presumably
> would require the parameterized-path stuff to have any chance of doing
> partial index scans over B.  However, I understand that's not the main
> issue here.

To take the mystery out of it, I was talking about any case where an
index scan is impossible or impractical. For instance, let's say the
ranges are computed values. Just to make it really impossible, let's say
the ranges are computed from columns in two different tables joined in a
subquery.

But yes, the ability of the planner to find the plan is also an issue
(hopefully less of one with the recent improvements).

> One thing that I think needs some analysis is when the range join idea
> is better or worse than a nested loop with inner index-scan, because
> potentially those are the options the planner has to choose between,
> and the costing model had better know enough to make the right thing
> happen.  It strikes me that the nested loop with inner index-scan is
> likely to be a win when there are large chunks of the indexed relation
> that the nestloop never needs to visit at all - imagine small JOIN big
> ON small.a && big.a, for example.  I suppose the really interesting
> question is how much we can save when the entirety of both relations
> has to be visited anyway - it seems promising, but I guess we won't
> know for sure without testing it.

Right, I will need to come up with a prototype that can at least test
the executor piece. I suspect that the plan choice won't be all that
different from an ordinary index nestloop versus mergejoin case, but
with much worse cardinality estimates to work with.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

19 April 2012, 04:10:17

On Tue, 2012-04-17 at 14:03 -0400, Robert Haas wrote:
> I'm actually not sure these are equivalent formulations.  Suppose one
> side has [i,i] where i ranges from 1 to 10000000 and the other side
> the exact same thing plus [1,10000000].  That one really big range
> will come up second on the right side, and you will not be able to
> discard it until you reach the end of the join.  If you just keep
> rewinding the right side, you're going to end up with O(n^2) behavior,
> whereas if you can discard tuples "from the middle" on the right side,
> then you will get O(n) behavior, which is a big difference.  In other
> words, in your original algorithm the tuples that you discard in step
> 4 or 5 need not be the first remaining tuple on whichever side of the
> join we're talking about.

To illustrate the problem (slightly modified), I'll write the sets out
verbatim rather than use range syntax:
 {1,2,3}              {1,2,3,4,5,6,7,8,9} {  2,3,4}            {  2,3,4}. . . . . {    3,4,5}          {    3,4,5}. . .
.{      4,5,6}        {      4,5,6}. . . {        5,6,7}      {        5,6,7}. . {          6,7,8}    {
6,7,8}.{            7,8,9}  {            7,8,9}

The "." are supposed to represent a "shadow" that the large range [1,9]
casts below it. This shadow prevents the discarding of [2,4] on the RHS
even when processing range [5,7] on the LHS, because we can't discard
out of the middle.

Note that, if you just have some large ranges and a lot of overlap,
that's not really a problem with the algorithm, it's just a large result
to compute. The problem comes when the ranges vary in size by quite a
lot, and there are many ranges that could be eliminated but can't
because of the shadow.

This problem can be mitigated substantially, I believe. Let me change
the algorithm (I don't like the way the pseudocode was structured
anyway), and word it a little more loosely:

1. Look at the top ranges on each side. Choose the one with the greater
upper bound, and call that Range1 from Side1, and the other range R2
from Side2. If either Range1 or Range2 is empty, terminate.
2. Scan down Side2, discarding ranges that are strictly before, and
joining with ranges that overlap, stopping when you hit a range that is
strictly after.
3. Now, discard Range1, and reset Side2 to the first non-discarded
range. Goto 1.

The benefit is, in step 1, we choose a side that will *always* discard
the top tuple. And if we choose the one with the greater upper bound,
then we are going to eliminate the largest shadow.

That doesn't eliminate the problem entirely, but it seems like it would
reduce it a lot.

Regarding the idea of discarding tuples in the middle, that might be an
interesting approach as well. It might be as simple as setting a flag in
the tuple header (like was done for full hash joins). Still not perfect,
but would make redundant checks very cheap. Combined with my strategy,
there's a good chance that we practically eliminate the problem.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

19 April 2012, 05:11:12

On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote:
> It would be a pretty weird implementation of mergejoin that could
> "discard tuples from the middle" of an input stream.  Or to be more
> specific, it wouldn't be the mergejoin itself that could do that at all
> --- you'd need the input plan node to be some sort of tweaked version of
> tuplestore or tuplesort that could respond to a request like that.

As I said in my reply to Robert, I think there are some ways we can make
this idea work.

> I can't escape the feeling that Jeff has chosen the wrong basis for his
> thought experiment, and that what he really ought to be thinking about
> is hashjoin, which keeps an in-memory table that it could easily modify
> on the fly if it chose to.  The multi-batch aspect of hybrid hashjoin
> could be useful too (IOW, when you're out of better ideas, throw the
> tuple back in the water to process later).

Obviously hashing is not going to be much use for anything but equality.
So I believe this approach is very similar to the temporary-index
method, except with batching, and always keeping the index in memory.

I don't think we would get the partitioning benefit of hashjoin, because
we'd have to put the same tuple in multiple partitions, so it's probably
better to just leave the outer side intact.

But in-memory indexes and multiple passes of the outer seems like a
reasonable alternative, particularly because an in-memory index might be
very fast (to build as well as query).

> This is just handwaving of course.  I think some digging in the
> spatial-join literature would likely find ideas better than any of
> these.

I will look in some more detail. The merge-like approach did seem to be
represented in the paper referenced by Alexander (the external plane
sweep), but it also refers to several methods based on partitioning.

I'm beginning to think that more than one of these ideas has merit.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Stefan Keller

Date:

17 January 2013, 20:03:34

Hi Jeff

2012/4/19 Jeff Davis <pgsql@j-davis.com>:
> On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote:
(...)
>> This is just handwaving of course.  I think some digging in the
>> spatial-join literature would likely find ideas better than any of
>> these.
>
> I will look in some more detail. The merge-like approach did seem to be
> represented in the paper referenced by Alexander (the external plane
> sweep), but it also refers to several methods based on partitioning.
>
> I'm beginning to think that more than one of these ideas has merit.
>
> Regards,
>         Jeff Davis

I'm perhaps really late in this discussion but I just was made aware
of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current
Feature Status"

What is the reason to digg into spatial-joins when there is PostGIS
being a bullet-proof and fast implementation?

Yours, Stefan

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

18 January 2013, 08:37:36

On Thu, 2013-01-17 at 21:03 +0100, Stefan Keller wrote:
> Hi Jeff

> I'm perhaps really late in this discussion but I just was made aware
> of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current
> Feature Status"
> 
> What is the reason to digg into spatial-joins when there is PostGIS
> being a bullet-proof and fast implementation?
> 

Hi Stefan,

You are certainly not too late.

PostGIS uses the existing postgres infrastructure to do spatial joins.
That mean it either does a cartesian product and filters the results, or
it uses a nested loop with an inner index scan.

That isn't too bad, but it could be better. I am trying to introduce a
new way to do spatial joins which will perform better in more
circumstances. For instance, we can't use an inner index if the input
tables are actually subqueries, because we can't index a subquery.

Regards,Jeff Davis

Re: 9.3 Pre-proposal: Range Merge Join

From

Stefan Keller

Date:

18 January 2013, 11:25:23

Hi Jeff

2013/1/18 Jeff Davis <pgsql@j-davis.com>:
> On Thu, 2013-01-17 at 21:03 +0100, Stefan Keller wrote:
>> Hi Jeff
>
>> I'm perhaps really late in this discussion but I just was made aware
>> of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current
>> Feature Status"
>>
>> What is the reason to digg into spatial-joins when there is PostGIS
>> being a bullet-proof and fast implementation?
>>
>
> Hi Stefan,
>
> You are certainly not too late.
>
> PostGIS uses the existing postgres infrastructure to do spatial joins.
> That mean it either does a cartesian product and filters the results, or
> it uses a nested loop with an inner index scan.
>
> That isn't too bad, but it could be better. I am trying to introduce a
> new way to do spatial joins which will perform better in more
> circumstances. For instance, we can't use an inner index if the input
> tables are actually subqueries, because we can't index a subquery.
>
> Regards,
>         Jeff Davis

Sounds good.
Did you already had contact e.g. with Paul (cc'ed just in case)?
And will this clever index also be available within all these hundreds
of PostGIS functions?

Regards, Stefan

Re: 9.3 Pre-proposal: Range Merge Join

From

Jeff Davis

Date:

18 January 2013, 18:24:44

On Fri, 2013-01-18 at 12:25 +0100, Stefan Keller wrote:
> Sounds good.
> Did you already had contact e.g. with Paul (cc'ed just in case)?
> And will this clever index also be available within all these hundreds
> of PostGIS functions?

Yes, I've brought the idea up to Paul before, but thank you.

It's not an index exactly, it's a new join algorithm that should be more
efficient for spatial joins. That being said, it's not done, so it could
be an index by the time I'm finished ;)

When it is done, it will probably need some minor work in PostGIS to
make use of it. But that work is done at the datatype level, and PostGIS
only has a couple data types, so I don't think it will be a lot of work.

Regards,Jeff Davis