Re: 9.3 Pre-proposal: Range Merge Join - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: 9.3 Pre-proposal: Range Merge Join
Date
Msg-id 1334607138.19915.89.camel@jdavis
Whole thread Raw
In response to Re: 9.3 Pre-proposal: Range Merge Join  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: 9.3 Pre-proposal: Range Merge Join  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: 9.3 Pre-proposal: Range Merge Join  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote:

> There is a good overview article about spatial joins.
> http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf 

Thank you, that's exactly the kind of overview I was looking for.

> It shows that there is a lot of methods based on building temporaty
> indexes. It might mean that building of temporary GiST index is not a
> bad idea.

That had occurred to me, but I was hesitant to only use temp indexes. It
still doesn't really offer a good solution when both sides of the join
are relatively large (because of random I/O). Also the build speed of
the index would be more important than it is now.

On the other hand, if I tackle the problem using temp indexes, it would
be a more general solution useful for many problems (for instance, we
really need a better solution when the result of a subquery doesn't fit
in a work_mem-sized hashtable).

We may end up with some kind of combination (as the sweep seems to be;
see below).

> Also there are methods which looks like extension of Jeff Davis
> proposal to the multidimensional case. It is Plane Sweep and
> External Plane Sweep methods. However, it might use sophisticated data
> structures for the "sweep". And I believe it should use it for good
> performance.

That method looks closer to what I had in mind.

My Range Join proposal is essentially the same as the sweep method
except that it only joins on one dimension, and the rest of the
dimensions have to be checked with RECHECK. So, this still works for an
N-dimensional join, as long as a single dimension is fairly selective.

The sweep method seems to do the first dimension with the sweep method,
and maintains a changing index that it uses to do the lookup. In other
words, it's essentially just a way to do the RECHECK on the other
dimensions in less than O(n) time. So, if the sweep dimension is not
selective at all, this degenerates to the temporary index method (or
something close, anyway).

The fact that, in the sweep method, there is still one "special"
dimension, makes me think that it will be easier to make the API work
based on ranges (which is a big win because postgres already knows about
ranges). If so, that makes it easier to divide the project into stages,
because we could get it working for ranges and then extend it to support
other dimensions more efficiently later.

Regards,Jeff Davis




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: index-only scans vs. Hot Standby, round two
Next
From: Robert Haas
Date:
Subject: Re: index-only scans vs. Hot Standby, round two