Re: 9.3 Pre-proposal: Range Merge Join - Mailing list pgsql-hackers

From Darren Duncan
Subject Re: 9.3 Pre-proposal: Range Merge Join
Date
Msg-id 4F8BB9B0.5090708@darrenduncan.net
Whole thread Raw
In response to 9.3 Pre-proposal: Range Merge Join  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: 9.3 Pre-proposal: Range Merge Join  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
Your proposal makes me think of something similar which might be useful, 
INclusion constraints.  As "exclusion constraints" might be thought of like a 
generalization of unique/key constraints, "inclusion constraints" are like a 
generalization of foreign key constraints.  The "inclusion constraints" 
basically allow some comparison operator other than is-equal to test if values 
in one table match values in another table, and the constraint allows the former 
if the test results in true.  An example of said inclusion test is whether the 
range in one table is contained in a range in another table.  I assume/hope 
that, similarly, now that we have range types in 9.2, that the existing 
"exclusion constraints" can be used with range comparison operators.

As to your actual proposal, it sounds like a generalization of the relational 
join or set intersection operator where instead of comparing sets defined in 
terms of an enumeration of discrete values we are comparing sets defined by a 
range, which conceptually have infinite values depending on the data type the 
range is defined over.  But if we're doing this, then it would seem to make 
sense to go further and see if we have set analogies for all of our relational 
or set operators, should we want to do work with non-discrete sets.

Now this sounds interesting in theory, but I would also assume that these could 
be implemented by an extension in terms of existing normal relational operators, 
where each range value is a discrete value, combined with operators for unioning 
or differencing etc ranges.  A relation of ranges effectively can represent a 
discontinuous range; in that case, the empty discontinuous range is also 
canonically representable by a relation with zero tuples.

Jeff, I get the impression your proposal is partly about helping performance by 
supporting this internally, rather than one just defining it as a SQL function, 
am I right?

-- Darren Duncan

Jeff Davis wrote:
> I hope this is not an inappropriate time for 9.3 discussions. The flip
> side of asking for submissions in the first couple commitfests means
> that I need to submit proposals now.
> 
> What is a Range Join?
> 
> See attached SQL for example. The key difference is that the join
> condition is not equality, but overlaps (&&).
> 
> Problem statement: slow. Nested loops are the only option, although they
> can benefit from an inner GiST index if available. But if the join is
> happening up in the plan tree somewhere, then it's impossible for any
> index to be available.
> 
> Proposed solution: a modified merge join that can handle ranges.
> 
>  1. Order the ranges on both sides by the lower bound, then upper bound.
> Empty ranges can be excluded entirely.
>  2. Left := first range on left, Right := first range on right
>  3. If Left or Right is empty, terminate.
>  4. If lower(Left) > upper(Right), discard Right, goto 2
>  5. If lower(Right) > upper(Left), discard Left, goto 2
>  6. return (Left, Right) as joined tuple
>  7. Right := next range on right
>  8. goto 3
> 
> If we get step 4 or step 5 keeps getting triggered, and a btree index is
> available (ordered by lower bound), we can re-probe to go to the correct
> position, and consider that the new top range on that side. This is an
> optimization for the case where there are large numbers of ranges with
> no match on the other side.
> 
> Thanks to Nathan Boley for helping me devise this algorithm. However,
> any bugs are mine alone ;)
> 
> Weaknesses: I haven't thought through the optimization, but I suspect it
> will be hard to be very accurate in the costing. That might be OK,
> because there aren't very many options anyway, but I'll need to think it
> through.
> 
> Questions:
> 
>  * Is this idea sane? -- that is, are ranges important enough that
> people are willing to maintain a new operator?
>  * The more general problem might be "spatial joins" which can operate
> in N dimensions, and I doubt this would work very well in that case.
> Does someone know of a spatial join algorithm (without IP claims) that
> would be as good as this one for ranges?
>  * Other thoughts?
> 
> Regards,
>     Jeff Davis


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: 9.3 Pre-proposal: Range Merge Join
Next
From: Heikki Linnakangas
Date:
Subject: Re: 9.3 Pre-proposal: Range Merge Join