Thread: Mark/Restore and avoiding RandomAccess sorts
Merge Joins require us to potentially Mark and Restore positions in the tuples arriving from executor sub-nodes. This currently means that if the tuples arrive from a Sort node, as they often do in an MJ, the sort node will be instructed to prepare a random access version of the sort result. That requires a full final merge of the output, so as to allow rewinding the input when a Restore operation is called. An MJ doesn't actually need random access, it just needs to be able to rewind. The question is: how far does it need to rewind? In many cases, the Restore operation moves back a small number of tuples, with a unique inner scan requiring a rewind of just one tuple. It would certainly be cheaper, in most cases, for the Sort node to maintain a variable size rewind buffer, where the history of prior tuples is truncated each time we perform a Mark operation. This could be implemented as a modified Tuplestore that could then be trimmed down each time a Mark operation took place. If the tuplestore has not yet spilled to disk this could be a trivial operation. Doing that would almost completely remove the overhead of the final merge step in the sort. The final merge often doubles elapsed time in cases where the sort is larger than work_mem, which it often is. Implementing the variable mark/restore buffer as a dumb Tuplestore would mean that the space usage of the Sort could in worst case go as high as x2 total space. The worst case is where the inner scan is all a single value. The best case is where the inner scan is sufficiently unique over all its values that it never writes back to disk at all. So a further refinement of this idea would be to simply defer the final merge operation for the sort until the history required for the Mark operation exceeded, say, 10% of the sort size. That would then be sufficient to improve performance for most common cases, without risking massive space overflow for large and highly non-unique data. There's no problem with running the final merge slightly later than before; everything's still there to allow it. Reusing space in the tuplestore is also straightforward since that's exactly what the final merge already does, so some rework of that code should be sufficient. This is a separate, but related idea of being able to avoid mark/restores completely when the outer scan is provably unique. I'm not intending to implement this idea just yet, but it seemed worth recording since it occurred to me - and discussing it as a TODO item. Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
I saw no replies to this. --------------------------------------------------------------------------- Simon Riggs wrote: > Merge Joins require us to potentially Mark and Restore positions in the > tuples arriving from executor sub-nodes. > > This currently means that if the tuples arrive from a Sort node, as they > often do in an MJ, the sort node will be instructed to prepare a random > access version of the sort result. That requires a full final merge of > the output, so as to allow rewinding the input when a Restore operation > is called. > > An MJ doesn't actually need random access, it just needs to be able to > rewind. The question is: how far does it need to rewind? In many cases, > the Restore operation moves back a small number of tuples, with a unique > inner scan requiring a rewind of just one tuple. > > It would certainly be cheaper, in most cases, for the Sort node to > maintain a variable size rewind buffer, where the history of prior > tuples is truncated each time we perform a Mark operation. This could be > implemented as a modified Tuplestore that could then be trimmed down > each time a Mark operation took place. If the tuplestore has not yet > spilled to disk this could be a trivial operation. > > Doing that would almost completely remove the overhead of the final > merge step in the sort. The final merge often doubles elapsed time in > cases where the sort is larger than work_mem, which it often is. > > Implementing the variable mark/restore buffer as a dumb Tuplestore would > mean that the space usage of the Sort could in worst case go as high as > x2 total space. The worst case is where the inner scan is all a single > value. The best case is where the inner scan is sufficiently unique over > all its values that it never writes back to disk at all. > > So a further refinement of this idea would be to simply defer the final > merge operation for the sort until the history required for the Mark > operation exceeded, say, 10% of the sort size. That would then be > sufficient to improve performance for most common cases, without risking > massive space overflow for large and highly non-unique data. There's no > problem with running the final merge slightly later than before; > everything's still there to allow it. Reusing space in the tuplestore is > also straightforward since that's exactly what the final merge already > does, so some rework of that code should be sufficient. > > This is a separate, but related idea of being able to avoid > mark/restores completely when the outer scan is provably unique. > > I'm not intending to implement this idea just yet, but it seemed worth > recording since it occurred to me - and discussing it as a TODO item. > > Comments? > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, 2007-01-06 at 17:06 -0500, Bruce Momjian wrote: > I saw no replies to this. Me neither. I take it that means its a very good idea and we should add a TODO -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > I saw no replies to this. Sounds like a good idea to me. (further comments below) > Simon Riggs wrote: >> Implementing the variable mark/restore buffer as a dumb Tuplestore would >> mean that the space usage of the Sort could in worst case go as high as >> x2 total space. The worst case is where the inner scan is all a single >> value. The best case is where the inner scan is sufficiently unique over >> all its values that it never writes back to disk at all. >> >> So a further refinement of this idea would be to simply defer the final >> merge operation for the sort until the history required for the Mark >> operation exceeded, say, 10% of the sort size. That would then be >> sufficient to improve performance for most common cases, without risking >> massive space overflow for large and highly non-unique data. There's no >> problem with running the final merge slightly later than before; >> everything's still there to allow it. Reusing space in the tuplestore is >> also straightforward since that's exactly what the final merge already >> does, so some rework of that code should be sufficient. Should definitely be done by reusing the space in the tuplestore, we don't want to use double the space we do now in the degenerate case. >> This is a separate, but related idea of being able to avoid >> mark/restores completely when the outer scan is provably unique. We probably wouldn't need to try to avoid the mark/restore completely, if the buffering scheme has low-enough overhead when restore is not called. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Jan 08, 2007 at 10:37:25AM +0000, Heikki Linnakangas wrote: > >Simon Riggs wrote: > >>Implementing the variable mark/restore buffer as a dumb Tuplestore would > >>mean that the space usage of the Sort could in worst case go as high as > >>x2 total space. The worst case is where the inner scan is all a single > >>value. The best case is where the inner scan is sufficiently unique over > >>all its values that it never writes back to disk at all. > >> > >>So a further refinement of this idea would be to simply defer the final > >>merge operation for the sort until the history required for the Mark > >>operation exceeded, say, 10% of the sort size. That would then be > >>sufficient to improve performance for most common cases, without risking > >>massive space overflow for large and highly non-unique data. There's no > >>problem with running the final merge slightly later than before; > >>everything's still there to allow it. Reusing space in the tuplestore is > >>also straightforward since that's exactly what the final merge already > >>does, so some rework of that code should be sufficient. > > Should definitely be done by reusing the space in the tuplestore, we > don't want to use double the space we do now in the degenerate case. Another idea comes to mind, which would apply to all sorts needing random access. Rather than completely copying every tuple to build a resultset you can step through randomnly, why not just build a list of tuple locations as the sort returns results? That would allow seeking to any position in the resultset with minimal overhead. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Simon Riggs wrote: > On Sat, 2007-01-06 at 17:06 -0500, Bruce Momjian wrote: > > I saw no replies to this. > > Me neither. > > I take it that means its a very good idea and we should add a TODO Added to TODO: * Improve merge join performance by allowing mark/restore of tuple sources http://archives.postgresql.org/pgsql-hackers/2007-01/msg00096.php -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Simon Riggs" <simon@2ndquadrant.com> writes: > Merge Joins require us to potentially Mark and Restore positions in the > tuples arriving from executor sub-nodes. I came across an old note to myself suggesting that we handle this by interposing a Materialize node, and then teaching Material that if it's told EXEC_FLAG_MARK but not EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD, it need keep data only as far back as the Mark position. So the structural requirements are mostly in place already, it's just a matter of figuring out a nice way to implement the "drop older parts of the tuplestore" business. regards, tom lane
On Wed, 2007-01-10 at 10:10 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Merge Joins require us to potentially Mark and Restore positions in the > > tuples arriving from executor sub-nodes. > > I came across an old note to myself suggesting that we handle this by > interposing a Materialize node, and then teaching Material that if it's > told EXEC_FLAG_MARK but not EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD, it > need keep data only as far back as the Mark position. So the structural > requirements are mostly in place already, it's just a matter of figuring > out a nice way to implement the "drop older parts of the tuplestore" > business. Same idea, I guess. Presumably we'd need to teach the Materialize node to pass straight through when the node does not receive any of EXEC_FLAG_MARK, EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD. The Materialize node would have to communicate with the Sort node so it could indicate when it had passed its max size limit, so the Sort could complete the final merge in-situ without wasting more space. Would it be ugly to have the Materialize poke into the SortState? Anyway, not just yet. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Presumably we'd need to teach the Materialize node to pass straight > through when the node does not receive any of EXEC_FLAG_MARK, > EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD. It does that already. > The Materialize node would have to communicate with the Sort node so it > could indicate when it had passed its max size limit, so the Sort could > complete the final merge in-situ without wasting more space. Would it be > ugly to have the Materialize poke into the SortState? I don't think this is workable; tuplesort is not designed to change from on-the-fly merge to not-on-the-fly on-the-fly. IIRC it's throwing away data as it goes in the first case, and you can't magically get it back. Changing this seems like a case of adding 90% more complexity to buy 10% more performance. It's already true that the planner avoids mergejoin when there are lots of duplicate inner tuples, so I do not think we need put lots of effort into performance improvements for the case of large distances back to the mark. Teaching Material how to handle a small mark distance cheaply should be sufficient. regards, tom lane
On Wed, 2007-01-10 at 10:46 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Presumably we'd need to teach the Materialize node to pass straight > > through when the node does not receive any of EXEC_FLAG_MARK, > > EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD. > > It does that already. > > > The Materialize node would have to communicate with the Sort node so it > > could indicate when it had passed its max size limit, so the Sort could > > complete the final merge in-situ without wasting more space. Would it be > > ugly to have the Materialize poke into the SortState? > > I don't think this is workable; tuplesort is not designed to change from > on-the-fly merge to not-on-the-fly on-the-fly. IIRC it's throwing away > data as it goes in the first case, and you can't magically get it back. It would have required a full re-sort and then scroll through to the point it had got to. Which was kindof expensive, but possible. > Changing this seems like a case of adding 90% more complexity to buy 10% > more performance. It's already true that the planner avoids mergejoin > when there are lots of duplicate inner tuples, so I do not think we need > put lots of effort into performance improvements for the case of large > distances back to the mark. Teaching Material how to handle a small > mark distance cheaply should be sufficient. OK, I'm happier with that anyway. Sounds straightforward now. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com