Thread: 9.3 Pre-proposal: Range Merge Join
I hope this is not an inappropriate time for 9.3 discussions. The flip side of asking for submissions in the first couple commitfests means that I need to submit proposals now. What is a Range Join? See attached SQL for example. The key difference is that the join condition is not equality, but overlaps (&&). Problem statement: slow. Nested loops are the only option, although they can benefit from an inner GiST index if available. But if the join is happening up in the plan tree somewhere, then it's impossible for any index to be available. Proposed solution: a modified merge join that can handle ranges. 1. Order the ranges on both sides by the lower bound, then upper bound. Empty ranges can be excluded entirely. 2. Left := first range on left, Right := first range on right 3. If Left or Right is empty, terminate. 4. If lower(Left) > upper(Right), discard Right, goto 2 5. If lower(Right) > upper(Left), discard Left, goto 2 6. return (Left, Right) as joined tuple 7. Right := next range on right 8. goto 3 If we get step 4 or step 5 keeps getting triggered, and a btree index is available (ordered by lower bound), we can re-probe to go to the correct position, and consider that the new top range on that side. This is an optimization for the case where there are large numbers of ranges with no match on the other side. Thanks to Nathan Boley for helping me devise this algorithm. However, any bugs are mine alone ;) Weaknesses: I haven't thought through the optimization, but I suspect it will be hard to be very accurate in the costing. That might be OK, because there aren't very many options anyway, but I'll need to think it through. Questions: * Is this idea sane? -- that is, are ranges important enough that people are willing to maintain a new operator? * The more general problem might be "spatial joins" which can operate in N dimensions, and I doubt this would work very well in that case. Does someone know of a spatial join algorithm (without IP claims) that would be as good as this one for ranges? * Other thoughts? Regards, Jeff Davis
Attachment
Your proposal makes me think of something similar which might be useful, INclusion constraints. As "exclusion constraints" might be thought of like a generalization of unique/key constraints, "inclusion constraints" are like a generalization of foreign key constraints. The "inclusion constraints" basically allow some comparison operator other than is-equal to test if values in one table match values in another table, and the constraint allows the former if the test results in true. An example of said inclusion test is whether the range in one table is contained in a range in another table. I assume/hope that, similarly, now that we have range types in 9.2, that the existing "exclusion constraints" can be used with range comparison operators. As to your actual proposal, it sounds like a generalization of the relational join or set intersection operator where instead of comparing sets defined in terms of an enumeration of discrete values we are comparing sets defined by a range, which conceptually have infinite values depending on the data type the range is defined over. But if we're doing this, then it would seem to make sense to go further and see if we have set analogies for all of our relational or set operators, should we want to do work with non-discrete sets. Now this sounds interesting in theory, but I would also assume that these could be implemented by an extension in terms of existing normal relational operators, where each range value is a discrete value, combined with operators for unioning or differencing etc ranges. A relation of ranges effectively can represent a discontinuous range; in that case, the empty discontinuous range is also canonically representable by a relation with zero tuples. Jeff, I get the impression your proposal is partly about helping performance by supporting this internally, rather than one just defining it as a SQL function, am I right? -- Darren Duncan Jeff Davis wrote: > I hope this is not an inappropriate time for 9.3 discussions. The flip > side of asking for submissions in the first couple commitfests means > that I need to submit proposals now. > > What is a Range Join? > > See attached SQL for example. The key difference is that the join > condition is not equality, but overlaps (&&). > > Problem statement: slow. Nested loops are the only option, although they > can benefit from an inner GiST index if available. But if the join is > happening up in the plan tree somewhere, then it's impossible for any > index to be available. > > Proposed solution: a modified merge join that can handle ranges. > > 1. Order the ranges on both sides by the lower bound, then upper bound. > Empty ranges can be excluded entirely. > 2. Left := first range on left, Right := first range on right > 3. If Left or Right is empty, terminate. > 4. If lower(Left) > upper(Right), discard Right, goto 2 > 5. If lower(Right) > upper(Left), discard Left, goto 2 > 6. return (Left, Right) as joined tuple > 7. Right := next range on right > 8. goto 3 > > If we get step 4 or step 5 keeps getting triggered, and a btree index is > available (ordered by lower bound), we can re-probe to go to the correct > position, and consider that the new top range on that side. This is an > optimization for the case where there are large numbers of ranges with > no match on the other side. > > Thanks to Nathan Boley for helping me devise this algorithm. However, > any bugs are mine alone ;) > > Weaknesses: I haven't thought through the optimization, but I suspect it > will be hard to be very accurate in the costing. That might be OK, > because there aren't very many options anyway, but I'll need to think it > through. > > Questions: > > * Is this idea sane? -- that is, are ranges important enough that > people are willing to maintain a new operator? > * The more general problem might be "spatial joins" which can operate > in N dimensions, and I doubt this would work very well in that case. > Does someone know of a spatial join algorithm (without IP claims) that > would be as good as this one for ranges? > * Other thoughts? > > Regards, > Jeff Davis
On 16.04.2012 08:40, Jeff Davis wrote: > Does someone know of a spatial join algorithm (without IP claims) that > would be as good as this one for ranges? I'd be happy with an algorithm that's specific to ranges, too, but my gut geeling is that there has to be a lot of research on spatial join algorithms out there. Implementing one of them would be more widely applicable, so I think we should look into spatial join algorithms first and see how far that gets us, before we write something specific to ranges. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jeff Davis <pgsql@j-davis.com> writes: > Proposed solution: a modified merge join that can handle ranges. > 1. Order the ranges on both sides by the lower bound, then upper bound. > Empty ranges can be excluded entirely. > 2. Left := first range on left, Right := first range on right > 3. If Left or Right is empty, terminate. > 4. If lower(Left) > upper(Right), discard Right, goto 2 > 5. If lower(Right) > upper(Left), discard Left, goto 2 > 6. return (Left, Right) as joined tuple > 7. Right := next range on right > 8. goto 3 This is surely not correct in detail. As written, it will be impossible for any tuple on the right side to be joined to more than one left-side tuple. You will need something analogous to the mark-and-restore rewinding logic in standard mergejoin to make this work. I can't escape the feeling that we could do this with more-or-less standard mergejoin logic paying attention to some fuzzy notion of equality, with the question of whether a given pair of ranges actually overlap being deferred to a filter condition. Not quite sure how to make that work though. > * Is this idea sane? -- that is, are ranges important enough that > people are willing to maintain a new operator? Dunno. It might be easier to sell the idea of adding support for range joins in a couple of years, after we've seen how much use ranges get. I will note that mergejoin is not only one of the more complicated executor modules, but it accounts for a huge percentage of the planner's complexity; which doesn't exactly make me sympathize with the notion of let's-copy-and-paste-all-that. It'd be a lot better if we could build it as a small tweak to mergejoin rather than an independent thing. (Having said that, I have no idea how the planner support could work at all, because its treatment of mergejoins is intimately tied to mergejoinable equality and EquivalenceClasses and PathKeys, which don't seem readily extensible to the range situation.) regards, tom lane
On Mon, Apr 16, 2012 at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dunno. It might be easier to sell the idea of adding support for range > joins in a couple of years, after we've seen how much use ranges get. Once we've started the journey towards range types we must complete it reasonably quickly. Having partially implemented, unoptimised features is the same as not having the feature at all, so it will remain unused until it really works. We could say that about many features, but if Jeff is championing this, I'd say go for it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 16, 2012 at 10:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
------
With best regards,
Alexander Korotkov.
On 16.04.2012 08:40, Jeff Davis wrote:I'd be happy with an algorithm that's specific to ranges, too, but my gut geeling is that there has to be a lot of research on spatial join algorithms out there. Implementing one of them would be more widely applicable, so I think we should look into spatial join algorithms first and see how far that gets us, before we write something specific to ranges.Does someone know of a spatial join algorithm (without IP claims) that
would be as good as this one for ranges?
There is a good overview article about spatial joins.
It shows that there is a lot of methods based on building temporaty indexes. It might mean that building of temporary GiST index is not a bad idea.
Also there are methods which looks like extension of Jeff Davis proposal to the multidimensional case. It is Plane Sweep and External Plane Sweep methods. However, it might use sophisticated data structures for the "sweep". And I believe it should use it for good performance.
With best regards,
Alexander Korotkov.
On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > 1. Order the ranges on both sides by the lower bound, then upper bound. > > Empty ranges can be excluded entirely. > > 2. Left := first range on left, Right := first range on right > > 3. If Left or Right is empty, terminate. > > 4. If lower(Left) > upper(Right), discard Right, goto 2 > > 5. If lower(Right) > upper(Left), discard Left, goto 2 > > 6. return (Left, Right) as joined tuple > > 7. Right := next range on right > > 8. goto 3 > > This is surely not correct in detail. As written, it will be impossible > for any tuple on the right side to be joined to more than one left-side > tuple. You will need something analogous to the mark-and-restore > rewinding logic in standard mergejoin to make this work. Every time you discard a tuple on the left, you go to step 2, which rewinds the right side back to the first non-discarded tuple. So, implemented using mark and restore: * start off with the marks at the first tuple on each side * "discard" means move the mark down a tuple * setting it backto the first range means restoring to the mark * going to the next range (step 7) just means getting another tuple,without changing the mark Only one side really needs the mark and restore logic, but it was easier to write the pseudocode in a symmetrical way (except step 7). > I will note that mergejoin is not only one of the more complicated > executor modules, but it accounts for a huge percentage of the planner's > complexity; which doesn't exactly make me sympathize with the notion of > let's-copy-and-paste-all-that. It'd be a lot better if we could build > it as a small tweak to mergejoin rather than an independent thing. > > (Having said that, I have no idea how the planner support could work > at all, because its treatment of mergejoins is intimately tied to > mergejoinable equality and EquivalenceClasses and PathKeys, which don't > seem readily extensible to the range situation.) I think this is the more important problem area. It's clear that I'll need to do some more investigation into the planning. Regards,Jeff Davis
On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote: > There is a good overview article about spatial joins. > http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf Thank you, that's exactly the kind of overview I was looking for. > It shows that there is a lot of methods based on building temporaty > indexes. It might mean that building of temporary GiST index is not a > bad idea. That had occurred to me, but I was hesitant to only use temp indexes. It still doesn't really offer a good solution when both sides of the join are relatively large (because of random I/O). Also the build speed of the index would be more important than it is now. On the other hand, if I tackle the problem using temp indexes, it would be a more general solution useful for many problems (for instance, we really need a better solution when the result of a subquery doesn't fit in a work_mem-sized hashtable). We may end up with some kind of combination (as the sweep seems to be; see below). > Also there are methods which looks like extension of Jeff Davis > proposal to the multidimensional case. It is Plane Sweep and > External Plane Sweep methods. However, it might use sophisticated data > structures for the "sweep". And I believe it should use it for good > performance. That method looks closer to what I had in mind. My Range Join proposal is essentially the same as the sweep method except that it only joins on one dimension, and the rest of the dimensions have to be checked with RECHECK. So, this still works for an N-dimensional join, as long as a single dimension is fairly selective. The sweep method seems to do the first dimension with the sweep method, and maintains a changing index that it uses to do the lookup. In other words, it's essentially just a way to do the RECHECK on the other dimensions in less than O(n) time. So, if the sweep dimension is not selective at all, this degenerates to the temporary index method (or something close, anyway). The fact that, in the sweep method, there is still one "special" dimension, makes me think that it will be easier to make the API work based on ranges (which is a big win because postgres already knows about ranges). If so, that makes it easier to divide the project into stages, because we could get it working for ranges and then extend it to support other dimensions more efficiently later. Regards,Jeff Davis
On Sun, 2012-04-15 at 23:18 -0700, Darren Duncan wrote: > Your proposal makes me think of something similar which might be useful, > INclusion constraints. As "exclusion constraints" might be thought of like a > generalization of unique/key constraints, "inclusion constraints" are like a > generalization of foreign key constraints. The "inclusion constraints" > basically allow some comparison operator other than is-equal to test if values > in one table match values in another table, and the constraint allows the former > if the test results in true. An example of said inclusion test is whether the > range in one table is contained in a range in another table. I assume/hope > that, similarly, now that we have range types in 9.2, that the existing > "exclusion constraints" can be used with range comparison operators. Yes, Range Foreign Keys are one of the loose ends for Range Types that I'd like to wrap up. > As to your actual proposal, it sounds like a generalization of the relational > join or set intersection operator where instead of comparing sets defined in > terms of an enumeration of discrete values we are comparing sets defined by a > range, which conceptually have infinite values depending on the data type the > range is defined over. But if we're doing this, then it would seem to make > sense to go further and see if we have set analogies for all of our relational > or set operators, should we want to do work with non-discrete sets. > > Now this sounds interesting in theory, but I would also assume that these could > be implemented by an extension in terms of existing normal relational operators, > where each range value is a discrete value, combined with operators for unioning > or differencing etc ranges. A relation of ranges effectively can represent a > discontinuous range; in that case, the empty discontinuous range is also > canonically representable by a relation with zero tuples. > > Jeff, I get the impression your proposal is partly about helping performance by > supporting this internally, rather than one just defining it as a SQL function, > am I right? Per my proposal, the problem statement is that it's "slow", so speeding it up is really the entire proposal ;) Extending the syntax might be interesting as well, but I don't have a proposal for that. Regards,Jeff Davis
On Tue, Apr 17, 2012 at 12:12 AM, Jeff Davis <pgsql@j-davis.com> wrote:
------
With best regards,
Alexander Korotkov.
On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote:Thank you, that's exactly the kind of overview I was looking for.
> There is a good overview article about spatial joins.
> http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdfThat had occurred to me, but I was hesitant to only use temp indexes. It
> It shows that there is a lot of methods based on building temporaty
> indexes. It might mean that building of temporary GiST index is not a
> bad idea.
still doesn't really offer a good solution when both sides of the join
are relatively large (because of random I/O). Also the build speed of
the index would be more important than it is now.
Note, that amount of random I/O during GiST index build significanlty dicreased after my GSoC project for buffering GiST index build. However, GiST index build is still quite CPU-expensive. This problem could be evaded by support of another methods of index build (another than producing a lot of penalty and picksplit calls). Hilbert curve and similar methods could help here. Implementation of such methods would require extension of GiST interface.
------
With best regards,
Alexander Korotkov.
On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis <pgsql@j-davis.com> wrote: > That had occurred to me, but I was hesitant to only use temp indexes. It > still doesn't really offer a good solution when both sides of the join > are relatively large (because of random I/O). Also the build speed of > the index would be more important than it is now. The thing I like most about temp indexes is that they needn't be temporary. I'd like to see something along the lines of demand-created optional indexes, that we reclaim space/maintenance overhead on according to some cache management scheme. More space you have, the more of the important ones hang around. The rough same idea applies to materialised views. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > I'd like to see something along the lines of demand-created optional > indexes, that we reclaim space/maintenance overhead on according to > some cache management scheme. More space you have, the more of the > important ones hang around. The rough same idea applies to > materialised views. +10; this sort of demand-driven optimization could be the database equivalent of Java's HotSpot (which accomplished the amazing task of making Java kinda fastish sometimes). Jay
On Apr 16, 2012, at 1:40 AM, Jeff Davis <pgsql@j-davis.com> wrote: > See attached SQL for example. The > Problem statement: slow. Nested loops are the only option, although they > can benefit from an inner GiST index if available. But if the join is > happening up in the plan tree somewhere, then it's impossible for any > index to be available. Hmm. This sounds like something that Tom's recent work on parameterized plans ought to have fixed, or if not, it seems closelyrelated. And by "this" I mean specifically the ability to use a GiST index to drive a nested loop that is higher upin the plan tree than the immediate parent of the index scan. This is not an argument against your proposal, just an observation. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Apr 16, 2012, at 1:40 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> See attached SQL for example. The >> Problem statement: slow. Nested loops are the only option, although they >> can benefit from an inner GiST index if available. But if the join is >> happening up in the plan tree somewhere, then it's impossible for any >> index to be available. > Hmm. This sounds like something that Tom's recent work on > parameterized plans ought to have fixed, or if not, it seems closely > related. Not really. It's still going to be a nestloop, and as such not terribly well suited for queries where there are a lot of matchable rows on both sides. The work I've been doing is really about making nestloops usable in cases where join order restrictions formerly prevented it --- but Jeff's complaint has nothing to do with that. (This thought also makes me a bit dubious about the nearby suggestions that more indexes will fix it.) regards, tom lane
On Mon, 2012-04-16 at 22:20 +0100, Simon Riggs wrote: > On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > That had occurred to me, but I was hesitant to only use temp indexes. It > > still doesn't really offer a good solution when both sides of the join > > are relatively large (because of random I/O). Also the build speed of > > the index would be more important than it is now. > > The thing I like most about temp indexes is that they needn't be temporary. > > I'd like to see something along the lines of demand-created optional > indexes, that we reclaim space/maintenance overhead on according to > some cache management scheme. More space you have, the more of the > important ones hang around. The rough same idea applies to > materialised views. I think to make an informed decision, the next thing I need to do is hack up a prototype version of my merge join idea, and see how well it performs against an index nestloop today. It seems to me that both approaches may have merit independently. Regards,Jeff Davis
On Mon, Apr 16, 2012 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The thing I like most about temp indexes is that they needn't be temporary. > > I'd like to see something along the lines of demand-created optional > indexes, that we reclaim space/maintenance overhead on according to > some cache management scheme. More space you have, the more of the > important ones hang around. The rough same idea applies to > materialised views. I find this a) really scary, b) a huge increase in scope, and c) kind of pointless. a) DDL and shared caches require all kinds of additional synchronization that would be really annoying to be incurring in the middle of DML queries unexpectedly. If these indexes are to be useful for other transactions they would also impose a hidden and uncontrollable cost on all updates and inserts on the table. b) I think it would be a lot harder to get working cleanly. You would need to invent a UI to control the lifetime and resources used by these objects and deal with duplication between manually created and dynamically created objects. It's fundamentally a huge shift in the design of the database changing what things the user maintains as part of their schema and what things the database maintains transparently. This is a big project on its own aside from the technical side of things. c) If these indexes are useful for many queries then the user can choose to create them themself. These indexes you're proposing would be just as expensive as any those indexes and offer few advantages aside from their automaticness. The same could be accomplished with some external demon that just looked at the query logs and determined which indexes to create or not. The main advantage of creating dynamic indexes as part of the query would be lost. Namely that these would be local data structures that don't need to be synchronized with other transactions and don't need to be updated by other transactions. They're just part of the query execution the way spilled hash tables and tuplesort tapes are. You could build indexes on materialized data resulting from earlier joins or aggregates and so on. The point is that if you make them equivalent to normal indexes just dynamically maintained then all you've done is change the way normal indexes work but haven't really changed the set of queries they're useful for. That might be a neat UI Feature but If you want to change the set of queries postgres can handle efficiently at all then you need something that's fundamentally different from a table index. As an aside I think some of what you're looking for could be better handled with some kind of query result cache that could keep around the materialized data from plan nodes until an event happens that changes the data. This might be related to the underlying infrastructure needed for materialized views but I haven't thought too deeply about it. It seems like a lot of work and a big change from the current very isolated per-process execution model. -- greg
On Mon, Apr 16, 2012 at 1:43 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote: >> Jeff Davis <pgsql@j-davis.com> writes: >> > 1. Order the ranges on both sides by the lower bound, then upper bound. >> > Empty ranges can be excluded entirely. >> > 2. Left := first range on left, Right := first range on right >> > 3. If Left or Right is empty, terminate. >> > 4. If lower(Left) > upper(Right), discard Right, goto 2 >> > 5. If lower(Right) > upper(Left), discard Left, goto 2 >> > 6. return (Left, Right) as joined tuple >> > 7. Right := next range on right >> > 8. goto 3 >> >> This is surely not correct in detail. As written, it will be impossible >> for any tuple on the right side to be joined to more than one left-side >> tuple. You will need something analogous to the mark-and-restore >> rewinding logic in standard mergejoin to make this work. > > Every time you discard a tuple on the left, you go to step 2, which > rewinds the right side back to the first non-discarded tuple. > > So, implemented using mark and restore: > > * start off with the marks at the first tuple on each side > * "discard" means move the mark down a tuple > * setting it back to the first range means restoring to the mark > * going to the next range (step 7) just means getting another > tuple, without changing the mark > > Only one side really needs the mark and restore logic, but it was easier > to write the pseudocode in a symmetrical way (except step 7). I'm actually not sure these are equivalent formulations. Suppose one side has [i,i] where i ranges from 1 to 10000000 and the other side the exact same thing plus [1,10000000]. That one really big range will come up second on the right side, and you will not be able to discard it until you reach the end of the join. If you just keep rewinding the right side, you're going to end up with O(n^2) behavior, whereas if you can discard tuples "from the middle" on the right side, then you will get O(n) behavior, which is a big difference. In other words, in your original algorithm the tuples that you discard in step 4 or 5 need not be the first remaining tuple on whichever side of the join we're talking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 16, 2012 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. This sounds like something that Tom's recent work on >> parameterized plans ought to have fixed, or if not, it seems closely >> related. > > Not really. It's still going to be a nestloop, and as such not terribly > well suited for queries where there are a lot of matchable rows on both > sides. The work I've been doing is really about making nestloops usable > in cases where join order restrictions formerly prevented it --- but > Jeff's complaint has nothing to do with that. (This thought also makes > me a bit dubious about the nearby suggestions that more indexes will > fix it.) I thought Jeff was parenthetically complaining about cases like A LEFT JOIN (B INNER JOIN C ON b.y = c.y) ON a.x && b.x. That presumably would require the parameterized-path stuff to have any chance of doing partial index scans over B. However, I understand that's not the main issue here. One thing that I think needs some analysis is when the range join idea is better or worse than a nested loop with inner index-scan, because potentially those are the options the planner has to choose between, and the costing model had better know enough to make the right thing happen. It strikes me that the nested loop with inner index-scan is likely to be a win when there are large chunks of the indexed relation that the nestloop never needs to visit at all - imagine small JOIN big ON small.a && big.a, for example. I suppose the really interesting question is how much we can save when the entirety of both relations has to be visited anyway - it seems promising, but I guess we won't know for sure without testing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 16, 2012 at 1:43 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> ... Only one side really needs the mark and restore logic, but it was easier >> to write the pseudocode in a symmetrical way (except step 7). > I'm actually not sure these are equivalent formulations. Suppose one > side has [i,i] where i ranges from 1 to 10000000 and the other side > the exact same thing plus [1,10000000]. That one really big range > will come up second on the right side, and you will not be able to > discard it until you reach the end of the join. If you just keep > rewinding the right side, you're going to end up with O(n^2) behavior, > whereas if you can discard tuples "from the middle" on the right side, > then you will get O(n) behavior, which is a big difference. In other > words, in your original algorithm the tuples that you discard in step > 4 or 5 need not be the first remaining tuple on whichever side of the > join we're talking about. It would be a pretty weird implementation of mergejoin that could "discard tuples from the middle" of an input stream. Or to be more specific, it wouldn't be the mergejoin itself that could do that at all --- you'd need the input plan node to be some sort of tweaked version of tuplestore or tuplesort that could respond to a request like that. I can't escape the feeling that Jeff has chosen the wrong basis for his thought experiment, and that what he really ought to be thinking about is hashjoin, which keeps an in-memory table that it could easily modify on the fly if it chose to. The multi-batch aspect of hybrid hashjoin could be useful too (IOW, when you're out of better ideas, throw the tuple back in the water to process later). This is just handwaving of course. I think some digging in the spatial-join literature would likely find ideas better than any of these. regards, tom lane
On Tue, 2012-04-17 at 14:24 -0400, Robert Haas wrote: > I thought Jeff was parenthetically complaining about cases like A LEFT > JOIN (B INNER JOIN C ON b.y = c.y) ON a.x && b.x. That presumably > would require the parameterized-path stuff to have any chance of doing > partial index scans over B. However, I understand that's not the main > issue here. To take the mystery out of it, I was talking about any case where an index scan is impossible or impractical. For instance, let's say the ranges are computed values. Just to make it really impossible, let's say the ranges are computed from columns in two different tables joined in a subquery. But yes, the ability of the planner to find the plan is also an issue (hopefully less of one with the recent improvements). > One thing that I think needs some analysis is when the range join idea > is better or worse than a nested loop with inner index-scan, because > potentially those are the options the planner has to choose between, > and the costing model had better know enough to make the right thing > happen. It strikes me that the nested loop with inner index-scan is > likely to be a win when there are large chunks of the indexed relation > that the nestloop never needs to visit at all - imagine small JOIN big > ON small.a && big.a, for example. I suppose the really interesting > question is how much we can save when the entirety of both relations > has to be visited anyway - it seems promising, but I guess we won't > know for sure without testing it. Right, I will need to come up with a prototype that can at least test the executor piece. I suspect that the plan choice won't be all that different from an ordinary index nestloop versus mergejoin case, but with much worse cardinality estimates to work with. Regards,Jeff Davis
On Tue, 2012-04-17 at 14:03 -0400, Robert Haas wrote: > I'm actually not sure these are equivalent formulations. Suppose one > side has [i,i] where i ranges from 1 to 10000000 and the other side > the exact same thing plus [1,10000000]. That one really big range > will come up second on the right side, and you will not be able to > discard it until you reach the end of the join. If you just keep > rewinding the right side, you're going to end up with O(n^2) behavior, > whereas if you can discard tuples "from the middle" on the right side, > then you will get O(n) behavior, which is a big difference. In other > words, in your original algorithm the tuples that you discard in step > 4 or 5 need not be the first remaining tuple on whichever side of the > join we're talking about. To illustrate the problem (slightly modified), I'll write the sets out verbatim rather than use range syntax: {1,2,3} {1,2,3,4,5,6,7,8,9} { 2,3,4} { 2,3,4}. . . . . { 3,4,5} { 3,4,5}. . . .{ 4,5,6} { 4,5,6}. . . { 5,6,7} { 5,6,7}. . { 6,7,8} { 6,7,8}.{ 7,8,9} { 7,8,9} The "." are supposed to represent a "shadow" that the large range [1,9] casts below it. This shadow prevents the discarding of [2,4] on the RHS even when processing range [5,7] on the LHS, because we can't discard out of the middle. Note that, if you just have some large ranges and a lot of overlap, that's not really a problem with the algorithm, it's just a large result to compute. The problem comes when the ranges vary in size by quite a lot, and there are many ranges that could be eliminated but can't because of the shadow. This problem can be mitigated substantially, I believe. Let me change the algorithm (I don't like the way the pseudocode was structured anyway), and word it a little more loosely: 1. Look at the top ranges on each side. Choose the one with the greater upper bound, and call that Range1 from Side1, and the other range R2 from Side2. If either Range1 or Range2 is empty, terminate. 2. Scan down Side2, discarding ranges that are strictly before, and joining with ranges that overlap, stopping when you hit a range that is strictly after. 3. Now, discard Range1, and reset Side2 to the first non-discarded range. Goto 1. The benefit is, in step 1, we choose a side that will *always* discard the top tuple. And if we choose the one with the greater upper bound, then we are going to eliminate the largest shadow. That doesn't eliminate the problem entirely, but it seems like it would reduce it a lot. Regarding the idea of discarding tuples in the middle, that might be an interesting approach as well. It might be as simple as setting a flag in the tuple header (like was done for full hash joins). Still not perfect, but would make redundant checks very cheap. Combined with my strategy, there's a good chance that we practically eliminate the problem. Regards,Jeff Davis
On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote: > It would be a pretty weird implementation of mergejoin that could > "discard tuples from the middle" of an input stream. Or to be more > specific, it wouldn't be the mergejoin itself that could do that at all > --- you'd need the input plan node to be some sort of tweaked version of > tuplestore or tuplesort that could respond to a request like that. As I said in my reply to Robert, I think there are some ways we can make this idea work. > I can't escape the feeling that Jeff has chosen the wrong basis for his > thought experiment, and that what he really ought to be thinking about > is hashjoin, which keeps an in-memory table that it could easily modify > on the fly if it chose to. The multi-batch aspect of hybrid hashjoin > could be useful too (IOW, when you're out of better ideas, throw the > tuple back in the water to process later). Obviously hashing is not going to be much use for anything but equality. So I believe this approach is very similar to the temporary-index method, except with batching, and always keeping the index in memory. I don't think we would get the partitioning benefit of hashjoin, because we'd have to put the same tuple in multiple partitions, so it's probably better to just leave the outer side intact. But in-memory indexes and multiple passes of the outer seems like a reasonable alternative, particularly because an in-memory index might be very fast (to build as well as query). > This is just handwaving of course. I think some digging in the > spatial-join literature would likely find ideas better than any of > these. I will look in some more detail. The merge-like approach did seem to be represented in the paper referenced by Alexander (the external plane sweep), but it also refers to several methods based on partitioning. I'm beginning to think that more than one of these ideas has merit. Regards,Jeff Davis
Hi Jeff 2012/4/19 Jeff Davis <pgsql@j-davis.com>: > On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote: (...) >> This is just handwaving of course. I think some digging in the >> spatial-join literature would likely find ideas better than any of >> these. > > I will look in some more detail. The merge-like approach did seem to be > represented in the paper referenced by Alexander (the external plane > sweep), but it also refers to several methods based on partitioning. > > I'm beginning to think that more than one of these ideas has merit. > > Regards, > Jeff Davis I'm perhaps really late in this discussion but I just was made aware of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current Feature Status" What is the reason to digg into spatial-joins when there is PostGIS being a bullet-proof and fast implementation? Yours, Stefan
On Thu, 2013-01-17 at 21:03 +0100, Stefan Keller wrote: > Hi Jeff > I'm perhaps really late in this discussion but I just was made aware > of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current > Feature Status" > > What is the reason to digg into spatial-joins when there is PostGIS > being a bullet-proof and fast implementation? > Hi Stefan, You are certainly not too late. PostGIS uses the existing postgres infrastructure to do spatial joins. That mean it either does a cartesian product and filters the results, or it uses a nested loop with an inner index scan. That isn't too bad, but it could be better. I am trying to introduce a new way to do spatial joins which will perform better in more circumstances. For instance, we can't use an inner index if the input tables are actually subqueries, because we can't index a subquery. Regards,Jeff Davis
Hi Jeff 2013/1/18 Jeff Davis <pgsql@j-davis.com>: > On Thu, 2013-01-17 at 21:03 +0100, Stefan Keller wrote: >> Hi Jeff > >> I'm perhaps really late in this discussion but I just was made aware >> of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current >> Feature Status" >> >> What is the reason to digg into spatial-joins when there is PostGIS >> being a bullet-proof and fast implementation? >> > > Hi Stefan, > > You are certainly not too late. > > PostGIS uses the existing postgres infrastructure to do spatial joins. > That mean it either does a cartesian product and filters the results, or > it uses a nested loop with an inner index scan. > > That isn't too bad, but it could be better. I am trying to introduce a > new way to do spatial joins which will perform better in more > circumstances. For instance, we can't use an inner index if the input > tables are actually subqueries, because we can't index a subquery. > > Regards, > Jeff Davis Sounds good. Did you already had contact e.g. with Paul (cc'ed just in case)? And will this clever index also be available within all these hundreds of PostGIS functions? Regards, Stefan
On Fri, 2013-01-18 at 12:25 +0100, Stefan Keller wrote: > Sounds good. > Did you already had contact e.g. with Paul (cc'ed just in case)? > And will this clever index also be available within all these hundreds > of PostGIS functions? Yes, I've brought the idea up to Paul before, but thank you. It's not an index exactly, it's a new join algorithm that should be more efficient for spatial joins. That being said, it's not done, so it could be an index by the time I'm finished ;) When it is done, it will probably need some minor work in PostGIS to make use of it. But that work is done at the datatype level, and PostGIS only has a couple data types, so I don't think it will be a lot of work. Regards,Jeff Davis