Thread: nested transactions
I am going to work on nested transactions for 7.4. My goal is to first implement nested transactions: BEGIN;SELECT ...BEGIN;UPDATE;COMMIT;DELETE;COMMIT; and later savepoints (Oracle): BEGIN;SELECT ...SAVEPOINT t1;UPDATE;SAVEPOINT t2;DELETE;ROLLBACK TO SAVEPOINT t2;COMMIT; I assume people want both. As an implementation, I will hack xact.c to create a transaction status stack so when you do a BEGIN inside a transaction, it saves the transaction status, the transaction block status, and perhaps the command counter. A COMMIT restores these values. I also plan to modify the on commit/abort actions. On a subtransaction commit, little has to be done, but on an ABORT, you must execute any abort actions required by that subtransaction _and_ remove any on commit actions for the subtransaction. There will need to be some code reorganization because some on commit/abort activity assumes only one transaction can be in process. A stack will need to be added in those cases. And finally, I must abort tuple changes made by the aborted subtransaction. One way of doing that is to keep all relation id's modified by the transaction, and do a sequential scan of the tables on abort, changing the transaction id's to a fixed aborted transaction id. However, this could be slow. (We could store tids if only a few rows are updated by a subtransaction. That would speed it up considerably.) Another idea is to use new transaction id's for the subtransactions, and update the transaction id status in pg_clog for the subtransactions, so that there is no transaction id renumbering required. One problem with this is the requirement of updating all the clog transaction id statuses atomically. One way to do that would be to do parent/child dependency in clog so that if a child is looked up and it is marked as "in process", a check could be done against the parent. Once the outer transaction is committed/aborted, those subtransactions could be updated so there would be no need to reference the parent any longer. This would increase the clog size per transaction from 2 bits to 4 bytes (two bits for status, 30 bits for offset to parent). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am going to work on nested transactions for 7.4. > [some details] This is, of course, barely scratching the surface of what will need to be done. I assume you've abandoned the notion of a fast release cycle for 7.4? 'Cause if you start on this, we ain't releasing any time soon ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am going to work on nested transactions for 7.4. > > [some details] > > This is, of course, barely scratching the surface of what will need to > be done. > > I assume you've abandoned the notion of a fast release cycle for 7.4? > 'Cause if you start on this, we ain't releasing any time soon ... Abandoned because of the delay in Win32 (end of Dec), PITR (not being worked on), and mostly because very few wanted a short release cycle. I will keep the transaction changes private to my tree, so if I can't get it done, I will just keep it for the next release. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > I am going to work on nested transactions for 7.4. If you're going to do a lot of reworking of how transactions are handled, maybe this is a good time to beg for cursors that stay open across commits. It looks like the JDBC driver is moving to using cursors with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having to fetch the entire result immediately and hold it in memory. If this were implemented, the same could be done for ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs. Thanks, Scott
On Friday 22 November 2002 04:36 pm, Scott Lamb wrote: > Bruce Momjian wrote: > > I am going to work on nested transactions for 7.4. > > If you're going to do a lot of reworking of how transactions are > handled, maybe this is a good time to beg for cursors that stay open > across commits. It looks like the JDBC driver is moving to using cursors > with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having > to fetch the entire result immediately and hold it in memory. If this > were implemented, the same could be done for > ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs. > I agree.It is my favorite features - and if you set savepoint I think that stay first solution (begin; ... ; begin; ...; begin; ...;comit; ...;commit;...; commit; Thanks Haris Peco
On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian <pgman@candle.pha.pa.us> wrote: >I am going to work on nested transactions for 7.4. > [...] >And finally, I must abort tuple changes made by the aborted >subtransaction. One way of doing that is to keep all relation id's >modified by the transaction, and do a sequential scan of the tables on >abort, changing the transaction id's to a fixed aborted transaction id. >However, this could be slow. (We could store tids if only a few rows >are updated by a subtransaction. That would speed it up considerably.) Depends on your definition of "few". I don't expect problems for up to several thousand tids. If there are more modified tuples, we could first reduce the list to page numbers, before finally falling back to table scans. >Another idea is to use new transaction id's for the subtransactions, and >[...] >would increase the clog size per transaction from 2 bits to 4 bytes >(two bits for status, 30 bits for offset to parent). Nice idea, this 30 bit offset. But one could argue that increased clog size even hurts users who don't use nested transactions at all. If parent/child dependency is kept separate from status bits (in pg_subtransxxxx files), additional I/O cost is only paid if subtransactions are actually used. New status bits (XMIN_IS_SUB, XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid lookups. I also thought of subtransaction xids in tuple headers as short lived information. Under certain conditions they can be replaced with the parent xid as soon as the parent transaction has finished. I proposed this to be done on the next tuple access just like we set committed/aborted flags now, though I'm not sure anymore that it is safe to do this. Old pg_subtrans files can be removed by VACUUM. One more difference between the two proposals: The former (locally remember modified tuples) can be used for recovery after a failed command. The latter (subtrans tree) can only help, if we give a new xid to each command, which I'm sure we don't want to do. ServusManfred
From: "Bruce Momjian" <pgman@candle.pha.pa.us> > And finally, I must abort tuple changes made by the aborted > subtransaction. One way of doing that is to keep all relation id's > modified by the transaction, and do a sequential scan of the tables on > abort, changing the transaction id's to a fixed aborted transaction id. > However, this could be slow. (We could store tids if only a few rows > are updated by a subtransaction. That would speed it up considerably.) Are you sure you don't want to use the log for this? It does mean that the log can grow without bound for long-lived transactions, but it's very straightforward and fast. Ken Hirsch
Manfred Koizar wrote: > On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > >I am going to work on nested transactions for 7.4. > > [...] > >And finally, I must abort tuple changes made by the aborted > >subtransaction. One way of doing that is to keep all relation id's > >modified by the transaction, and do a sequential scan of the tables on > >abort, changing the transaction id's to a fixed aborted transaction id. > >However, this could be slow. (We could store tids if only a few rows > >are updated by a subtransaction. That would speed it up considerably.) > > Depends on your definition of "few". I don't expect problems for up > to several thousand tids. If there are more modified tuples, we could > first reduce the list to page numbers, before finally falling back to > table scans. Yes, and the key point is that those are kept only in the backend local memory, so clearly thousands are possible. The outer transaction takes care of all the ACID issues. > >Another idea is to use new transaction id's for the subtransactions, and > >[...] > >would increase the clog size per transaction from 2 bits to 4 bytes > >(two bits for status, 30 bits for offset to parent). > > Nice idea, this 30 bit offset. But one could argue that increased > clog size even hurts users who don't use nested transactions at all. > If parent/child dependency is kept separate from status bits (in > pg_subtransxxxx files), additional I/O cost is only paid if > subtransactions are actually used. New status bits (XMIN_IS_SUB, > XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid > lookups. > > I also thought of subtransaction xids in tuple headers as short lived > information. Under certain conditions they can be replaced with the > parent xid as soon as the parent transaction has finished. I proposed > this to be done on the next tuple access just like we set > committed/aborted flags now, though I'm not sure anymore that it is > safe to do this. > > Old pg_subtrans files can be removed by VACUUM. > > One more difference between the two proposals: The former (locally > remember modified tuples) can be used for recovery after a failed > command. The latter (subtrans tree) can only help, if we give a new > xid to each command, which I'm sure we don't want to do. The interesting issue is that if we could set the commit/abort bits all at the same time, we could have the parent/child dependency local to the backend --- other backends don't need to know the parent, only the status of the (subtransaction's) xid, and they need to see all those xid's committed at the same time. You could store the backend slot id in pg_clog rather than the parent xid and look up the status of the outer xid for that backend slot. That would allow you to use 2 bytes, with a max of 16k backends. The problem is that on a crash, the pg_clog points to invalid slots --- it would probably have to be cleaned up on startup. But still, you have an interesting idea of just setting the bit to be "I am a child". The trick is allowing backends to figure out who's child you are. We could store this somehow in shared memory, but that is finite and there can be lots of xid's for a backend using subtransactions. I still think there must be a clean way, but I haven't figured it out yet. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Ken Hirsch wrote: > From: "Bruce Momjian" <pgman@candle.pha.pa.us> > > And finally, I must abort tuple changes made by the aborted > > subtransaction. One way of doing that is to keep all relation id's > > modified by the transaction, and do a sequential scan of the tables on > > abort, changing the transaction id's to a fixed aborted transaction id. > > However, this could be slow. (We could store tids if only a few rows > > are updated by a subtransaction. That would speed it up considerably.) > > Are you sure you don't want to use the log for this? It does mean that the > log can grow without bound for long-lived transactions, but it's very > straightforward and fast. I don't think we want to have unlimited log file growth for long running transactions/subtransactions. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Is there going to be a way to use transactions inside transactions of transactions? In other words: BEGIN; BEGIN; BEGIN; BEGIN; COMMIT; COMMIT; COMMIT; COMMIT; Is there a way to have some sort of recursive solution with every transaction but the first one being a child transaction? Is there a way to implement that without too much extra effort? I just curious how that could be done. Hans
On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian <pgman@candle.pha.pa.us> wrote: >The interesting issue is that if we could set the commit/abort bits all >at the same time, we could have the parent/child dependency local to the >backend --- other backends don't need to know the parent, only the >status of the (subtransaction's) xid, and they need to see all those >xid's committed at the same time. You mean the commit/abort bit in the tuple headers? Yes, this would be interesting, but I see no way how this could be done. If it could, there would be no need for pg_clog. Reading your paragraph above one more time I think you mean the bits in pg_clog: Each subtransaction gets its own xid. On ROLLBACK the abort bits of the aborted (sub)transaction and all its children are set in pg_clog immediately. This operation does not have to be atomic. On subtransaction COMMIT nothing happens to pg_clog, the status is only changed locally, the subtransaction still looks "in progress" to other backends. Only when the main transaction commits, we set the commit bits of the main transaction and all its non-aborted children in pg_clog. This action has to be atomic. Right? AFAICS the problem lies in updating several pg_clog bits at once. How can this be done without holding a potentially long lasting lock? >You could store the backend slot id in pg_clog rather than the parent >xid and look up the status of the outer xid for that backend slot. That >would allow you to use 2 bytes, with a max of 16k backends. The problem >is that on a crash, the pg_clog points to invalid slots --- it would >probably have to be cleaned up on startup. Again I would try to keep pg_clog compact and store the backend slots in another file, thus not slowing down instances where subtransactions are nor used. Apart from this minor detail I don't see, how this is supposed to work. Could you elaborate? >But still, you have an interesting idea of just setting the bit to be "I >am a child". The idea was to set subtransaction bits in the tuple header. Here is yet another different idea: Let the currently unused fourth state in pg_clog indicate a committed subtransaction. There are two bits per transaction, commit and abort, with the following meaning: a c0 0 transaction in progress, the owning backend knows whether it is a main- or a sub-transaction, other backendsdon't care1 0 aborted, nobody cares whether main- or sub-transaction0 1 committed main-transaction (*)1 1 committedsub-transaction, have to look for parent in pg_subtrans If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly as a side effect of a visibility check, or by vacuum, or by COMMIT/ROLLBACK), this could save a lot of parent lookups without having to touch the xids in the tuple headers. So (*) should read: committed main-transaction or committed sub-transaction having a committed parent. >The trick is allowing backends to figure out who's child >you are. We could store this somehow in shared memory, but that is >finite and there can be lots of xid's for a backend using >subtransactions. The subtrans dependencies have to be visible to all backends. Store them to disk just like pg_clog. In older proposals I spoke of a pg_subtrans "table" containing (parent, child) pairs. This was only meant as a concept, not as a real SQL table subject to MVCC. An efficient(?) implementation could be an array of parent xids, indexed by child xid. Most of it can be stolen from the clog code. One more argument for pg_subtrans being visible to all backends: If an UPDATE is about to change a tuple touched by another active transaction, it waits for the other transaction to commit or abort. We must always wait for the main transaction, not the subtrans. >I still think there must be a clean way, I hope so ... > but I haven't figured it out yet. Are we getting nearer? ServusManfred
Hans-J�rgen Sch�nig wrote: > Is there going to be a way to use transactions inside transactions of > transactions? > In other words: > > BEGIN; > BEGIN; > BEGIN; > BEGIN; > > COMMIT; > COMMIT; > COMMIT; > COMMIT; > > Is there a way to have some sort of recursive solution with every > transaction but the first one being a child transaction? > Is there a way to implement that without too much extra effort? > I just curious how that could be done. Sure, nesting will be unlimited. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Manfred Koizar wrote: > On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > >The interesting issue is that if we could set the commit/abort bits all > >at the same time, we could have the parent/child dependency local to the > >backend --- other backends don't need to know the parent, only the > >status of the (subtransaction's) xid, and they need to see all those > >xid's committed at the same time. > > You mean the commit/abort bit in the tuple headers? Yes, this would > be interesting, but I see no way how this could be done. If it could, > there would be no need for pg_clog. > > Reading your paragraph above one more time I think you mean the bits > in pg_clog: Each subtransaction gets its own xid. On ROLLBACK the Right. > abort bits of the aborted (sub)transaction and all its children are > set in pg_clog immediately. This operation does not have to be > atomic. On subtransaction COMMIT nothing happens to pg_clog, the Right, going from RUNNING to ABORTED doesn't have to be atomic because both tuples are invisible. > status is only changed locally, the subtransaction still looks "in > progress" to other backends. Only when the main transaction commits, > we set the commit bits of the main transaction and all its non-aborted > children in pg_clog. This action has to be atomic. Right? Right. We can't have some backends looking at part of the transaction as committed while at the same time other backends see the transaction as in process. > AFAICS the problem lies in updating several pg_clog bits at once. How > can this be done without holding a potentially long lasting lock? Yes, locking is one possible solution, but no one likes that. One hack lock idea would be to create a subtransaction-only lock, so if you see the special 4-th xact state (about to be committed as part of a subtransaction) you have to wait on that lock (held by the backend twiddling the xact bits), then look again. That basically would serialize all the bit-twiddling for subtransactions. I am sure I am going to get a "yuck" from the audience on that one, but I am not sure how long that bit twiddling could take. Does xact twiddle every cause I/O? I think it could, which would be a pretty big performance problem. It would serialize the subtransaction commits _and_ block anyone trying to get the status of those subtransactions. We would not use the the 4th xid status during the transaction, only while we were twiddling the bits on commit. > >You could store the backend slot id in pg_clog rather than the parent > >xid and look up the status of the outer xid for that backend slot. That > >would allow you to use 2 bytes, with a max of 16k backends. The problem > >is that on a crash, the pg_clog points to invalid slots --- it would > >probably have to be cleaned up on startup. > > Again I would try to keep pg_clog compact and store the backend slots > in another file, thus not slowing down instances where subtransactions > are nor used. Apart from this minor detail I don't see, how this is > supposed to work. Could you elaborate? The trick is that when that 4th status is set, backends looking up the status all need to point to a central location that can be set for all of them at once, hence the original idea of putting the parent xid in the clog file. We don't _need_ to do that, but we do need a way to _point_ to a central location where the status can be looked up. > >But still, you have an interesting idea of just setting the bit to be "I > >am a child". > > The idea was to set subtransaction bits in the tuple header. Here is > yet another different idea: Let the currently unused fourth state in > pg_clog indicate a committed subtransaction. There are two bits per > transaction, commit and abort, with the following meaning: > > a c > 0 0 transaction in progress, the owning backend knows whether it is > a main- or a sub-transaction, other backends don't care > 1 0 aborted, nobody cares whether main- or sub-transaction > 0 1 committed main-transaction (*) > 1 1 committed sub-transaction, have to look for parent in > pg_subtrans > > If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly > as a side effect of a visibility check, or by vacuum, or by > COMMIT/ROLLBACK), this could save a lot of parent lookups without > having to touch the xids in the tuple headers. Yes, you could do that, but we can easily just set the clog bits atomically, and it will not be needed --- the tuple bits really don't help us, I think. > So (*) should read: committed main-transaction or committed > sub-transaction having a committed parent. > > >The trick is allowing backends to figure out who's child > >you are. We could store this somehow in shared memory, but that is > >finite and there can be lots of xid's for a backend using > >subtransactions. > > The subtrans dependencies have to be visible to all backends. Store > them to disk just like pg_clog. In older proposals I spoke of a > pg_subtrans "table" containing (parent, child) pairs. This was only > meant as a concept, not as a real SQL table subject to MVCC. An > efficient(?) implementation could be an array of parent xids, indexed > by child xid. Most of it can be stolen from the clog code. OK, we put it in a file. And how do we efficiently clean it up? Remember, it is only to be used for a _brief_ period of time. I think a file system solution is doable if we can figure out a way not to create a file for every xid. Do we spin through the files (one per outer transaction) looking for a matching xid when we see that 4th xact state? Maybe we write the xid's to a file in a special directory in sorted order, and backends can do a btree search of each file in that directory looking for the xid, and then knowing the master xid, look up that status, and once all the children xid's are updated, you delete the file. > One more argument for pg_subtrans being visible to all backends: If > an UPDATE is about to change a tuple touched by another active > transaction, it waits for the other transaction to commit or abort. > We must always wait for the main transaction, not the subtrans. Yes, but again, the xid status of subtransactions is only update just before commit of the main transaction, so there is little value to having those visible. Let's keep going! -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
pgman wrote: > > AFAICS the problem lies in updating several pg_clog bits at once. How > > can this be done without holding a potentially long lasting lock? > > Yes, locking is one possible solution, but no one likes that. One hack > lock idea would be to create a subtransaction-only lock, so if you see > the special 4-th xact state (about to be committed as part of a > subtransaction) you have to wait on that lock (held by the backend > twiddling the xact bits), then look again. That basically would > serialize all the bit-twiddling for subtransactions. I am sure I am > going to get a "yuck" from the audience on that one, but I am not sure > how long that bit twiddling could take. Does xact twiddle every cause > I/O? I think it could, which would be a pretty big performance problem. > It would serialize the subtransaction commits _and_ block anyone trying > to get the status of those subtransactions. We would not use the the > 4th xid status during the transaction, only while we were twiddling the > bits on commit. Let me correct this. Transaction state readers _don't_ have to block while the subtransaction is twiddling bits. Logic would be: Set all aborted subtransaction status bitsGrab subxact lockSet subxact global status bit to in progressSet all subtransactionxids to SUBXACT_TO_COMMITSet subxact global status bit to committed (commit happens here)Set all SUBXACT_TO_COMMITxids to COMMITTEDRelease subxact lock Any transaction looking up a subtransaction that has an SUBXACT_TO_COMMIT state has to consult the global subxact status bit, which is a global variable in shared memory. What this basically does is to funnel all subxid lookups into a single global subxid status bit. In fact, even the outer transaction has to be set to SUBXACT_TO_COMMIT so it commits at the same time as the subtransactions. On crash recovery, all SUBXACT_TO_COMMIT have to be cleaned up, somehow, perhaps using WAL. The only downside to this approach is that subtransaction bit twiddling is serialized. This is an interesting idea becuase it has overhead only to backends using subtransactions. It does kill our "multiple commits at the same time" that we can do with normal transactions. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, locking is one possible solution, but no one likes that. One hack > lock idea would be to create a subtransaction-only lock, so if you see > the special 4-th xact state (about to be committed as part of a > subtransaction) you have to wait on that lock (held by the backend > twiddling the xact bits), then look again. That basically would > serialize all the bit-twiddling for subtransactions. I am sure I am > going to get a "yuck" from the audience on that one, You sure are. > but I am not sure > how long that bit twiddling could take. Does xact twiddle every cause > I/O? Yes, if the page of pg_clog you need to touch is not currently in a buffer. With a large transaction you might have hundreds of subtransactions, which could take an unpleasantly long time to mark all committed. What's worse, I think the above proposal requires a *single* lock for this purpose (if there's more than one, how shall the requestor know which one to block on?) --- so you are serializing all transaction commits that have subtransactions, with only one able to go through at a time. That will really, really not do; the performance will be way worse than the chaining idea we discussed before. > You could store the backend slot id in pg_clog rather than the parent > xid and look up the status of the outer xid for that backend slot. That > would allow you to use 2 bytes, with a max of 16k backends. This is also a bad idea, because backend slot ids are not stable (by the time you look in PG_PROC, the slot may be occupied by a new, unrelated backend process). > But still, you have an interesting idea of just setting the bit to be "I > am a child". That bit alone doesn't help; you need to know *whose* child. AFAICS, the objection to putting parent xact IDs into pg_clog is basically a performance issue: bigger clog means more I/O. This is surely true; but the alternatives proposed so far are worse. regards, tom lane
I should add that I am not prepared to overhaul the pg_clog file format as part of adding subtransactions for 7.4. I can do the tid/sequential scan method for abort, or the single-lock method described. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Yes, locking is one possible solution, but no one likes that. One hack > > lock idea would be to create a subtransaction-only lock, so if you see > > the special 4-th xact state (about to be committed as part of a > > subtransaction) you have to wait on that lock (held by the backend > > twiddling the xact bits), then look again. That basically would > > serialize all the bit-twiddling for subtransactions. I am sure I am > > going to get a "yuck" from the audience on that one, > > You sure are. > > > but I am not sure > > how long that bit twiddling could take. Does xact twiddle every cause > > I/O? > > Yes, if the page of pg_clog you need to touch is not currently in a > buffer. With a large transaction you might have hundreds of > subtransactions, which could take an unpleasantly long time to mark > all committed. > > What's worse, I think the above proposal requires a *single* lock for > this purpose (if there's more than one, how shall the requestor know > which one to block on?) --- so you are serializing all transaction > commits that have subtransactions, with only one able to go through at > a time. That will really, really not do; the performance will be way > worse than the chaining idea we discussed before. > > > You could store the backend slot id in pg_clog rather than the parent > > xid and look up the status of the outer xid for that backend slot. That > > would allow you to use 2 bytes, with a max of 16k backends. > > This is also a bad idea, because backend slot ids are not stable (by the > time you look in PG_PROC, the slot may be occupied by a new, unrelated > backend process). > > > But still, you have an interesting idea of just setting the bit to be "I > > am a child". > > That bit alone doesn't help; you need to know *whose* child. > > AFAICS, the objection to putting parent xact IDs into pg_clog is > basically a performance issue: bigger clog means more I/O. This is > surely true; but the alternatives proposed so far are worse. > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I should add that I am not prepared to overhaul the pg_clog file format > as part of adding subtransactions for 7.4. I can do the tid/sequential scan > method for abort, or the single-lock method described. If you think that changing the pg_clog file format would be harder than either of those other ideas, I think you're very badly mistaken. pg_clog is touched only by one rather simple module. I think the other methods will be completely unacceptable from a performance point of view. They could maybe work if subtransactions were a seldom-used feature; but the people who want to use 'em are mostly talking about a subtransaction for *every* command. If you design your implementation on the assumption that subtransactions are infrequent, it will be unusably slow. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I should add that I am not prepared to overhaul the pg_clog file format > > as part of adding subtransactions for 7.4. I can do the tid/sequential scan > > method for abort, or the single-lock method described. > > If you think that changing the pg_clog file format would be harder than > either of those other ideas, I think you're very badly mistaken. > pg_clog is touched only by one rather simple module. Agreed, the clog changes would be the simple solution. However, I am not sure I can make that level of changes. In fact, the complexity of having multiple transactions per backend is going to be tough for me too. Also, I should point out that balooning pg_clog by 16x is going to mean we are perhaps 4-8x more likely to need extra pages to mark all subtransactions. Isn't there some other way we can link these subtransactions together rather than mucking with pg_clog, as we only need the linkage while we mark them all committed? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Also, I should point out that balooning pg_clog by 16x is going to mean > we are perhaps 4-8x more likely to need extra pages to mark all > subtransactions. So? The critical point is that we don't need to serialize the pg_clog operations if we do it that way. Also, we can certainly expand the number of pg_clog pages held in memory by some amount. Right now it's only 4, IIRC. We could make it 64 and probably no one would even notice. > Isn't there some other way we can link these subtransactions together > rather than mucking with pg_clog, as we only need the linkage while we > mark them all committed? You *cannot* expect to do it all in shared memory; you will be blown out of the water by the first long transaction that comes along, if you try. So the question is not whether we put the status into a file, it is only what representation we choose. Manfred suggested a separate log file ("pg_subclog" or some such) but I really don't see any operational advantage to that. You still end up with 4 bytes per transaction, you're just assuming that putting them in a different file makes it better. I don't see how. regards, tom lane
Tom Lane wrote: > > Isn't there some other way we can link these subtransactions together > > rather than mucking with pg_clog, as we only need the linkage while we > > mark them all committed? > > You *cannot* expect to do it all in shared memory; you will be blown out > of the water by the first long transaction that comes along, if you try. > So the question is not whether we put the status into a file, it is only > what representation we choose. > > Manfred suggested a separate log file ("pg_subclog" or some such) but > I really don't see any operational advantage to that. You still end up > with 4 bytes per transaction, you're just assuming that putting them > in a different file makes it better. I don't see how. It only becomes better if we can throw away that file (or contents) when the transaction completes and we have marked all the subtransactions as completed. We can't compress pg_clog if we store the parent info in there. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It only becomes better if we can throw away that file (or contents) when > the transaction completes and we have marked all the subtransactions as > completed. We can't compress pg_clog if we store the parent info in > there. But we already have a recycling mechanism for pg_clog. AFAICS, creating a parallel log file with a separate recycling mechanism is a study in wasted effort. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It only becomes better if we can throw away that file (or contents) when > > the transaction completes and we have marked all the subtransactions as > > completed. We can't compress pg_clog if we store the parent info in > > there. > > But we already have a recycling mechanism for pg_clog. AFAICS, > creating a parallel log file with a separate recycling mechanism is > a study in wasted effort. But that recycling requires the vacuum of every database in the system. Do people do that frequently enough? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> But we already have a recycling mechanism for pg_clog. AFAICS, >> creating a parallel log file with a separate recycling mechanism is >> a study in wasted effort. > But that recycling requires the vacuum of every database in the system. > Do people do that frequently enough? Once the auto vacuum code is in there, they won't have any choice ;-) In any case, I saw no part of your proposal that removed the need for vacuum, so what's your point? regards, tom lane
On Thu, 28 Nov 2002 21:46:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred suggested a separate log file ("pg_subclog" or some such) but >I really don't see any operational advantage to that. You still end up >with 4 bytes per transaction, you're just assuming that putting them >in a different file makes it better. I don't see how. There are two points: 1) If your site/instance/application/whatever... does not use nested transactions or does use them only occasionally, you don't have to pay the additional I/O cost. 2) If we update a subtransaction's pg_clog bits as soon as the status of the main transaction is known, pg_subtrans is only visited once per subtransaction, while pg_clog has to be looked up once per tuple. Things might look different however, if we wrap every command into a subtransaction... ServusManfred
On Friday 29 November 2002 00:56, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> But we already have a recycling mechanism for pg_clog. AFAICS, > >> creating a parallel log file with a separate recycling mechanism is > >> a study in wasted effort. > > > > But that recycling requires the vacuum of every database in the system. > > Do people do that frequently enough? > > Once the auto vacuum code is in there, they won't have any choice ;-) OK, I know postgres needs to be vacuumed every so often (I think its to guarantee safe XID wraparound?) I think the AVD should do something to guarnatee this is hapening. Since I am working on AVD, what are the criterea for this? From the above I assume it also pertains to pg_clog recycling (which is related to XID wraparound?), but I know nothing about that. Right now AVD only performs vacuum analyze on specific tables as it deems they need it, it does not perform vacuum on entire databases at any point yet.
Manfred Koizar <mkoi-pg@aon.at> writes: > 1) If your site/instance/application/whatever... does not use nested > transactions or does use them only occasionally, you don't have to pay > the additional I/O cost. As I already said to Bruce, designing this facility on the assumption that it will be seldom-used is a recipe for failure. Everybody and his brother wants commands that don't abort the whole transaction. As soon as this facility exists, you can bet that the standard mode of operation will become "one subtransaction per interactive command". If you don't design it to support that load, you may as well not bother to build it at all. > 2) If we update a subtransaction's pg_clog bits as soon as the status > of the main transaction is known, pg_subtrans is only visited once per > subtransaction, while pg_clog has to be looked up once per tuple. How you figure that? It seems to me the visit rate is exactly the same, you've just divided it into two files. Having to touch two files instead of one seems if anything worse. regards, tom lane
"Matthew T. O'Connor" <matthew@zeut.net> writes: > Right now AVD only performs vacuum analyze on specific tables as it deems they > need it, it does not perform vacuum on entire databases at any point yet. See http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/routine-vacuuming.html However I think that only talks about XID wraparound and datfrozenxid. pg_clog recycling is driven off the oldest datvacuumxid in pg_database; the AVD should think about launching a database-wide vacuum whenever age(datvacuumxid) exceeds a million or two. regards, tom lane
Manfred Koizar wrote: > One more argument for pg_subtrans being visible to all backends: If > an UPDATE is about to change a tuple touched by another active > transaction, it waits for the other transaction to commit or abort. > We must always wait for the main transaction, not the subtrans. This issue kills the idea that we can get away with providing lookup to the other backends _only_ while we are twiddling the clog bits. Other transactions are going to need to know if the XID they see on the tuple is owned by an active backend. This means we have to provide child/master xid lookup during the transaction, meaning we may as well use pg_clog or separate file, especially if we can get autovacuum for 7.4. It kills the idea that somehow locking would work. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 28 Nov 2002 12:59:21 -0500 (EST), Bruce Momjian <pgman@candle.pha.pa.us> wrote: >Yes, locking is one possible solution, but no one likes that. One hack >lock idea would be to create a subtransaction-only lock, [...] > >> [...] without >> having to touch the xids in the tuple headers. > >Yes, you could do that, but we can easily just set the clog bits >atomically, From what I read above I don't think we can *easily* set more than one transaction's bits atomically. > and it will not be needed --- the tuple bits really don't >help us, I think. Yes, this is what I said, or at least tried to say. I just wanted to make clear how this new approach (use the fourth status) differs from older proposals (replace subtransaction ids in tuple headers). >OK, we put it in a file. And how do we efficiently clean it up? >Remember, it is only to be used for a _brief_ period of time. I think a >file system solution is doable if we can figure out a way not to create >a file for every xid. I don't want to create one file for every transaction, but rather a huge (sparse) array of parent xids. This array is divided into manageable chunks, represented by files, "pg_subtrans_NNNN". These files are only created when necessary. At any time only a tiny part of the whole array is kept in shared buffers. This concept is similar or almost equal to pg_clog, which is an array of doublebits. >Maybe we write the xid's to a file in a special directory in sorted >order, and backends can do a btree search of each file in that directory >looking for the xid, and then knowing the master xid, look up that >status, and once all the children xid's are updated, you delete the >file. Yes, dense arrays or btrees are other possible implementations. But for simplicity I'd do it pg_clog style. >Yes, but again, the xid status of subtransactions is only update just >before commit of the main transaction, so there is little value to >having those visible. Having them visible solves the atomicity problem without requiring long locks. Updating the status of a single (main or sub) transaction is atomic, just like it is now. Here is what is to be done for some operations: BEGIN main transaction:Get a new xid (no change to current behaviour).pg_clog[xid] is still 00, meaning active.pg_subtrans[xid]is still 0, meaning no parent. BEGIN subtransaction:Push current transaction info onto local stack.Get a new xid.Record parent xid in pg_subtrans[xid].pg_clog[xid]is still 00. ROLLBACK subtransaction:Set pg_clog[xid] to 10 (aborted).Optionally set clog bits for subsubtransactions to 10.Pop transactioninfo from stack. COMMIT subtransaction:Set pg_clog[xid] to 11 (committed subtrans).Don't touch clog bits for subsubtransactions!Pop transactioninfo from stack. ROLLBACK main transaction:Set pg_clog[xid] to 10 (aborted).Optionally set clog bits for subtransactions to 10. COMMIT main transaction:Set pg_clog[xid] to 01 (committed).Optionally set clog bits for subtransactions from 11 to 01.Don'ttouch clog bits for aborted subtransactions! Visibility check by other transactions: If a tuple is visited and its XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to be consulted to find out the status of the inserting/deleting transaction xid. If pg_clog[xid] is ... 00: transaction still active 10: aborted 01: committed 11: committed subtransaction, have to check parent Only in this last case do we have to get parentxid from pg_subtrans. Now we look at pg_clog[parentxid]. If we find ... 00: parent still active, so xid is considered active, too 10: parent aborted, so xid is considered aborted, optionally set pg_clog[xid] = 10 01: parent committed, so xid is considered committed, optionally set pg_clog[xid] = 01 11: recursively check grandparent(s) ... For brevity the following operations are not covered in detail: . Visibility checks for tuples inserted/deleted by a (sub)transaction belonging to the current transaction tree (have to check local transaction stack whenever we look at a xid or switch to a parent xid) . HeapTupleSatisfiesUpdate (sometimes has to wait for parent transaction) The trick here is, that subtransaction status is immediately updated in pg_clog on commit/abort. Main transaction commit is atomic (just set its commit bit). Status 11 is short-lived, it is replaced with the final status by one or more of - COMMIT/ROLLBACK of the main transaction- a later visibility check (as a side effect)- VACUUM pg_subtrans cleanup: A pg_subtrans_NNNN file covers a known range of transaction ids. As soon as none of these transactions has a pg_clog status of 11, the pg_subtrans_NNNN file can be removed. VACUUM can do this, and it won't even have to check the heap. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > Visibility check by other transactions: If a tuple is visited and its > XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to > be consulted to find out the status of the inserting/deleting > transaction xid. If pg_clog[xid] is ... > 00: transaction still active > 10: aborted > 01: committed > 11: committed subtransaction, have to check parent > Only in this last case do we have to get parentxid from pg_subtrans. Unfortunately this discussion is wrong. User-level visibility checks will usually have to fetch the parentxid in case 01 as well, because even if the parent is committed, it might not be visible in our snapshot. Snapshots will record only topmost-parent XIDs (because that's what we can find in the PG_PROC array, and anything else would create atomicity problems anyway). So we must chase to the topmost parent before testing visibility. This means that the parentxid will need to be fetched in enough cases that it's quite dubious that pushing it to a different file saves I/O. Also, using a 11 state doubles the amount of pg_clog I/O needed to commit a collection of subtransactions. You have to write 11 as the state of each commitable subtransaction, then commit the parent (write 01 as its state), then go back and change the state of each subtransaction to 01. (Whether this last bit is done as part of parent transaction commit, or during later inspections of the state of the subtransaction, doesn't change the argument.) I think it would be preferable to use only three states: active, aborted, committed. The parent commit protocol is (1) write 10 as state of each aborted subtransaction (this should be done as soon as the subtransaction is known aborted, rather than delaying to parent commit); (2) write 01 as state of parent (this is the atomic commit); (3) write 01 as state of each committed subtransaction. Readers who see 00 must check the parent state; if the parent is committed then they have to go back and recheck the child state (to see if it became "aborted" after they looked). This halves the write traffic during a commit, at the cost of additional read traffic when subtransaction state is checked in a narrow window after the time of parent transaction commit. I believe it nets out to be faster. regards, tom lane
On Fri, 29 Nov 2002 13:33:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Unfortunately this discussion is wrong. User-level visibility checks >will usually have to fetch the parentxid in case 01 as well, because >even if the parent is committed, it might not be visible in our >snapshot. Or we don't allow a subtransaction's status to be updated from 11 to 01 until we know, that the main transaction is visible to all active transactions. Didn't check whether this is expensive to find out. At least it should be doable by VACCUM. >Snapshots will record only topmost-parent XIDs (because >that's what we can find in the PG_PROC array, and anything else would >create atomicity problems anyway). So we must chase to the topmost >parent before testing visibility. BTW, I think this *forces* us to replace the sub xid with the respective main xid in a tuple header, when we set XMIN/MAX_IS_COMMITTED. Otherwise we'd have to look for the main xid, whenever a tuple is touched. >Also, using a 11 state doubles the amount of pg_clog I/O needed to >commit a collection of subtransactions. Is a pg_clog page written out to disk each time a bit is changed? I'd expect some locality. >I think it would be preferable to use only three states: active, >aborted, committed. The parent commit protocol is (1) write 10 as state >of each aborted subtransaction (this should be done as soon as the >subtransaction is known aborted, rather than delaying to parent commit); >(2) write 01 as state of parent (this is the atomic commit); (3) write >01 as state of each committed subtransaction. Readers who see 00 must >check the parent state; if the parent is committed then they have to go >back and recheck the child state (to see if it became "aborted" after >they looked). Nice idea! This saves the fourth status for future uses (for example, Firebird uses it for two phase commit). OTOH for reasons you mentioned above there's no chance to save parent xid lookups, if we go this way. >This halves the write traffic during a commit, at the >cost of additional read traffic when subtransaction state is checked in >a narrow window after the time of parent transaction commit. I believe >it nets out to be faster. Maybe. The whole point of my approach is: If we can limit the active range of transactions requiring parent xid lookups to a small fraction of the range needing pg_clog lookups, then it makes sense to store status bits and parent xids in different files. Otherwise keeping them together in one file clearly is faster. ServusManfred
Manfred Koizar wrote: > On Fri, 29 Nov 2002 13:33:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> > wrote: > >Unfortunately this discussion is wrong. User-level visibility checks > >will usually have to fetch the parentxid in case 01 as well, because > >even if the parent is committed, it might not be visible in our > >snapshot. > > Or we don't allow a subtransaction's status to be updated from 11 to > 01 until we know, that the main transaction is visible to all active > transactions. Didn't check whether this is expensive to find out. At > least it should be doable by VACCUM. > > >Snapshots will record only topmost-parent XIDs (because > >that's what we can find in the PG_PROC array, and anything else would > >create atomicity problems anyway). So we must chase to the topmost > >parent before testing visibility. > > BTW, I think this *forces* us to replace the sub xid with the > respective main xid in a tuple header, when we set > XMIN/MAX_IS_COMMITTED. Otherwise we'd have to look for the main xid, > whenever a tuple is touched. Sorry, I don't follow this. As far as I know, we will set the subxid on the tuple so we can independently mark the xact as aborted without revisiting all the tuples. Once it is committed/rolled back, I see no need to lookup the parent, and in fact we could clear the clog parent xid offset so there is no way to access the parent anymore. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Manfred Koizar <mkoi-pg@aon.at> writes: > Maybe. The whole point of my approach is: If we can limit the active > range of transactions requiring parent xid lookups to a small fraction > of the range needing pg_clog lookups, then it makes sense to store > status bits and parent xids in different files. Otherwise keeping > them together in one file clearly is faster. Hmm ... I'm not sure that that's possible. But wait a moment. The child xid is by definition always greater than (newer than) its parent. So if we consult pg_clog and find the transaction marked committed, *and* the xid is before the window of XIDs in our snapshot, then even if it's not a top-level xid, the parent must be before our window too. Therefore we can conclude the transaction is visible in our snapshot. So indeed there is a good-size range of xids for which we'll never need to chase the parent link: everything before the RecentGlobalXmin computed by GetSnapshotData. (We do have to set subtransactions to committed during parent commit to make this true; we can't update them lazily. But I think that's okay.) Maybe you're right --- we could probably truncate pg_subtrans faster than pg_clog, and we could definitely expect to keep less of it in memory than pg_clog. regards, tom lane
I am concerned this is getting beyond my capabilities for 7.4 --- anyone want to help? --------------------------------------------------------------------------- Tom Lane wrote: > Manfred Koizar <mkoi-pg@aon.at> writes: > > Maybe. The whole point of my approach is: If we can limit the active > > range of transactions requiring parent xid lookups to a small fraction > > of the range needing pg_clog lookups, then it makes sense to store > > status bits and parent xids in different files. Otherwise keeping > > them together in one file clearly is faster. > > Hmm ... I'm not sure that that's possible. > > But wait a moment. The child xid is by definition always greater than > (newer than) its parent. So if we consult pg_clog and find the > transaction marked committed, *and* the xid is before the window of XIDs > in our snapshot, then even if it's not a top-level xid, the parent must > be before our window too. Therefore we can conclude the transaction is > visible in our snapshot. So indeed there is a good-size range of xids > for which we'll never need to chase the parent link: everything before > the RecentGlobalXmin computed by GetSnapshotData. (We do have to set > subtransactions to committed during parent commit to make this true; > we can't update them lazily. But I think that's okay.) > > Maybe you're right --- we could probably truncate pg_subtrans faster > than pg_clog, and we could definitely expect to keep less of it in > memory than pg_clog. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
[Sorry for the delay. I'm a bit busy these days.] On Fri, 29 Nov 2002 15:57:17 -0500 (EST), Bruce Momjian <pgman@candle.pha.pa.us> wrote: >> BTW, I think this *forces* us to replace the sub xid with the >> respective main xid in a tuple header, when we set >> XMIN/MAX_IS_COMMITTED. Otherwise we'd have to look for the main xid, >> whenever a tuple is touched. > >Sorry, I don't follow this. Probably because we've mixed up several proposals. I'll try to pick them apart below. >As far as I know, we will set the subxid on >the tuple so we can independently mark the xact as aborted without >revisiting all the tuples. Yes. >Once it is committed/rolled back, These cases are completely different. If a (main or sub-) transaction is rolled back, its effects are invisible to all transactions; this status is immediately effective and final. OTOH a subtransaction commit is only tentative. It becomes effective when the main transaction commits. (And the subtransaction's status turns to aborted, when the main transaction aborts.) >I see no >need to lookup the parent, and in fact we could clear the clog parent >xid offset so there is no way to access the parent anymore. While a subtransaction is seen as "tentatively committed" other transactions have to look up its parent to find out its effective status. Proposal A was: Never show "tentatively committed" to outside transactions. This would require neither any new flags in tuple headers or in pg_clog nor a globally visible pg_subtrans structure. But it only works, if we can commit a main transaction and all its subtransactions atomically, which is only possible if we hold a long lasting lock. Did we agree that we don't want this? All other solutions require a parent xid lookup at least during the time span while a subtransaction is marked "tentatively committed" and not yet known to be "finally committed". IIRC we have three proposals how the "tentatively committed" status can be shown to outside transactions: (B) Two flags in the tuple header (one for xmin, one for xmax) telling us "the xid is a subtransaction". I don't like this very much, because it's not in Normal Form: "is a subtransaction" is NOT a property of a tuple. OTOH we can declare it a denormalization for performance reasons (we don't have to look up the parend xid, if the flag is not set.) (C) Explicitly use the fourth possible status in pg_clog for "tentatively committed". (Performance hack: replace with "finally committed" as soon as the xid is visible to all active transactions.) (D) Only one kind of "committed" in pg_clog; always look for a parent in pg_subtrans; for performance reasons integrate pg_subtrans into pc_clog. Tom brought up the snapshot visibility problem which applies to B, C, and D. While each of these proposals can be implemented (relatively) straight forward, the Black Art is: When and how can we modify the stored state to avoid repeated parent xid lookups? We'll find out ... ServusManfred
Manfred Koizar wrote: > [Sorry for the delay. I'm a bit busy these days.] > > On Fri, 29 Nov 2002 15:57:17 -0500 (EST), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > >> BTW, I think this *forces* us to replace the sub xid with the > >> respective main xid in a tuple header, when we set > >> XMIN/MAX_IS_COMMITTED. Otherwise we'd have to look for the main xid, > >> whenever a tuple is touched. > > > >Sorry, I don't follow this. > > Probably because we've mixed up several proposals. I'll try to pick > them apart below. OK. > These cases are completely different. If a (main or sub-) transaction > is rolled back, its effects are invisible to all transactions; this > status is immediately effective and final. OTOH a subtransaction > commit is only tentative. It becomes effective when the main > transaction commits. (And the subtransaction's status turns to > aborted, when the main transaction aborts.) Right. > >I see no > >need to lookup the parent, and in fact we could clear the clog parent > >xid offset so there is no way to access the parent anymore. > > While a subtransaction is seen as "tentatively committed" other > transactions have to look up its parent to find out its effective > status. Right. And we need those lookups to parent from the start of the subtransaction until the commit/abort of the main transaction. If it aborts, we can shorten that, but if they are all commit, we have to wait, and they have to be visible because other backends have to know if the "Running" status of the transaction is still associated with an active transaction, and we can only stamp one xid on a backend because shared memory is limited. > Proposal A was: Never show "tentatively committed" to outside > transactions. This would require neither any new flags in tuple > headers or in pg_clog nor a globally visible pg_subtrans structure. > But it only works, if we can commit a main transaction and all its > subtransactions atomically, which is only possible if we hold a long > lasting lock. Did we agree that we don't want this? Again, we still need the lookup to main transaction for other backend lookups, so this idea is dead, and we don't want locking. > All other solutions require a parent xid lookup at least during the > time span while a subtransaction is marked "tentatively committed" and > not yet known to be "finally committed". IIRC we have three proposals > how the "tentatively committed" status can be shown to outside > transactions: Yes. > (B) Two flags in the tuple header (one for xmin, one for xmax) telling > us "the xid is a subtransaction". I don't like this very much, > because it's not in Normal Form: "is a subtransaction" is NOT a > property of a tuple. OTOH we can declare it a denormalization for > performance reasons (we don't have to look up the parend xid, if the > flag is not set.) I see no reason to do that when we have that 4th state available in pg_clog. They are going to lookup the xid status anyway, so why not check that "is subtransaction" status at that point too. Of course, we can't mark "IS COMMITTED" on the tuple until the main transaction commits, but that is simple logic. > (C) Explicitly use the fourth possible status in pg_clog for > "tentatively committed". (Performance hack: replace with "finally > committed" as soon as the xid is visible to all active transactions.) Yes, I think this is the only way to go. If we need that 4th state later, we can refactor the code, but for our purposes now, it is useful. > (D) Only one kind of "committed" in pg_clog; always look for a parent > in pg_subtrans; for performance reasons integrate pg_subtrans into > pc_clog. Seems that 4th state makes this an easy optimization, causing zero overhead for backends that _don't_ use subtransactions, except for backends looking up the status of other backends with subtransactions. > Tom brought up the snapshot visibility problem which applies to B, C, > and D. > > While each of these proposals can be implemented (relatively) straight > forward, the Black Art is: When and how can we modify the stored > state to avoid repeated parent xid lookups? We'll find out ... I think there is now general agreement that we want a separate table to store parent xids for subtransactions that is only looked up when that 4th clog state is set, and once the main transaction commits, all those 4th state clog entries can be cleaned up to simple commit. We can also expire the pg_subtrans table for any xids less than the lowest running backend xid, which is pretty significant optimization. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > I am going to work on nested transactions for 7.4. > > My goal is to first implement nested transactions: > > BEGIN; > SELECT ... > BEGIN; > UPDATE; > COMMIT; > DELETE; > COMMIT; > > and later savepoints (Oracle): > > > BEGIN; > SELECT ... > SAVEPOINT t1; > UPDATE; > SAVEPOINT t2; > DELETE; > ROLLBACK TO SAVEPOINT t2; > COMMIT; > > I assume people want both. Yep. My question is: how do you see cursors working with nested transactions? Right now you can't do cursors outside of transactions. Subtransactions would complicate things a bit: BEGIN; DECLARE CURSOR x ... BEGIN (is cursor x visible here? What are the implications of using it if it is?) ... COMMIT; ... COMMIT; Would we only allow cursors within the innermost transactions? If we allow them anywhere else, why retain the requirement that they be used within transactions at all? -- Kevin Brown kevin@sysexperts.com
Kevin Brown wrote: > My question is: how do you see cursors working with nested > transactions? > > Right now you can't do cursors outside of transactions. > Subtransactions would complicate things a bit: > > BEGIN; > DECLARE CURSOR x ... > BEGIN > (is cursor x visible here? What are the implications of using it if > it is?) > ... > COMMIT; > ... > COMMIT; > > > Would we only allow cursors within the innermost transactions? If we > allow them anywhere else, why retain the requirement that they be used > within transactions at all? I talked to Tom and he feels it will be too hard to rollback a subtransaction that affects cursors so we will disable use of cursors in subtransactions, at least in the first implementation. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073