Thread: Nested transactions: low level stuff
Here is, what I've put together from various messages posted November/December last year. . Subtrans trees. Transaction states. Tuple visibility. HeapTupleSatifiesUpdate. Shortcuts. Still missing. Objections andsuggestions Subtrans trees -------------- A subtrans tree holds information about sub-transaction to parent transaction relationships. The nodes are transaction ids, edges denote the fact that the child node represents a sub-transaction of the transaction represented by the parent node. The root of a subtrans tree represents a main-transaction. A subtrans tree is navigable from child to parent. I.e. if you have a sub-transaction xid you can find its parent efficiently, but there is no fast and easy way to enumerate all children of a given parent xid. The subtrans dependencies have to be visible to all backends. At least two possible implementations have been discussed. Together with clog: Store a 30 bit parent transaction offset with the two state bits. Offset 0 means the transaction is a main-transaction. In a separate data structure: pg_subtrans is a huge (possibly sparsely populated) array of parent xids, indexed by child xid. This array is divided into manageable chunks, represented by files, "pg_subtrans_NNNN". These files are only created when necessary. At any time only a tiny part of the whole array is kept in shared buffers. This concept is similar or almost equal to pg_clog, which is an array of doublebits. Most of its implementation can be stolen from the clog code. We believe that parent xid lookups are far less frequent than transaction state lookups and therefor favour the second implementation. Transaction states ------------------ Let the currently unused fourth state in pg_clog indicate a committed subtransaction. In pg_clog there are two bits per transaction, commit and abort, with the following meaning: a c0 0 transaction in progress, the owning backend knows whether it is a main- or a sub-transaction, other backendsdon't care1 0 aborted, nobody cares whether main- or sub-transaction0 1 committed main-transaction or - withshortcut 2 - a sub- transaction that's known committed to all active transactions1 1 committed sub-transaction,have to look for parent in pg_subtrans Updating the state of a single (main or sub) transaction is atomic, just like it is now. This solves the atomicity problem without requiring long locks. Here is what is to be done for some operations: BEGIN main transaction: . Get a new xid (no change to current behaviour). . pg_clog[xid] is still 00, meaning active. . pg_subtrans[xid]is still 0, meaning no parent. BEGIN subtransaction: . Push current transaction info onto local stack. . Get a new xid. . Record parent xid in pg_subtrans[xid].. pg_clog[xid] is still 00. ROLLBACK subtransaction: . Set pg_clog[xid] to 10 (aborted). . Optionally set clog bits for subsubtransactions to 10 (*).. Pop transaction info from stack. COMMIT subtransaction: . Set pg_clog[xid] to 11 (committed subtrans). . Don't touch clog bits for subsubtransactions! . Poptransaction info from stack. ROLLBACK main transaction: . Set pg_clog[xid] to 10 (aborted). . Optionally set clog bits for subtransactions to 10 (*). COMMIT main transaction: . Set pg_clog[xid] to 01 (committed). . Don't touch clog bits for subtransactions! (*) This can only be done if there is a transaction local data structure holding information about all subtransactions or we have a subtrans tree implementation that is navigable from parent to child. Anyway, whether this is done or not does not influence the consistency of this proposal. Tuple visibility ---------------- Visibility check by other transactions: If a tuple is visited and its XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to be consulted to find out the status of the inserting/deleting transaction xid. If pg_clog[xid] is ... 00: transaction still active 10: aborted 01: committed 11: committed subtransaction, have to check parent Only in this last case do we have to get parentxid from pg_subtrans. Now we look at pg_clog[parentxid]. If we find ... 00: parent still active, so xid is considered active, too 10: parent aborted, so xid is considered aborted 01: parent committed, so xid is considered committed 11: recursively check grandparent(s) ... If a visibility check sees a transaction as committed, it checks whether the main-transaction xid is visible in the current snapshot. Fortunately we get the main-transaction xid as a by-product of looking up the parentxid states. If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace a sub-transaction xid in xmin or xmax respectively with the main-transaction xid at the same time. Otherwise we'd have to look for the main xid, whenever a tuple is touched. XXX "at the same time" may need a bit more thought regarding locking. HeapTupleSatifiesUpdate ----------------------- If an UPDATE is about to change a tuple touched by another active transaction, it waits for the other transaction to commit or abort. We must always wait for the main transaction, not the subtrans. Shortcuts --------- Under certain conditions the 1/1 state can be replaced with 0/1 or 1/0. This could save a lot of parent lookups. Shortcut 1: As soon as a transaction is rolled back all its subtransactions are considered aborted, too. So their pg_clog bits can be set to 10 and there is no need to look up their parent any more. This can be done . immediately on ROLLBACK (if the implementation has a list of committed subtransactions handy),. as a side effect of a later visibility check, . by VACUUM. Shortcut 2: If a committed subtransaction has only committed (grand)parents up to the committed main transaction and the main transaction is visible to all running transactions, i.e. the main transaction's xid is before RecentGlobalXmin (*), then the subtransaction's pg_clog bits can be set to 01 and further parent xid lookups are not needed any more. This can be done . as a side effect of a later visibility check, . by VACUUM. (*) AFAICS the statements "main transaction's xid is before RecentGlobalXmin" and "subtransaction's xid is before RecentGlobalXmin" are equivalent. The point here is: pg_subtrans is only queried, if we find 11 in pg_clog; we try to replace 11 a soon as possible, but only if it is safe to do so. Status 11 is short-lived, it is replaced with the final status by one or more of - ROLLBACK of the parent transaction- a later visibility check (as a side effect)- VACUUM pg_subtrans cleanup: A pg_subtrans_NNNN file covers a known range of transaction ids. As soon as none of these transactions has a pg_clog status of 11, the pg_subtrans_NNNN file can be removed. VACUUM can do this, and it won't even have to check the heap. Still missing ------------- . Visibility checks for tuples inserted/deleted by a (sub)transaction belonging to the current transaction tree (have to check local transaction stack whenever we look at a xid or switch to a parent xid) . WAL Objections and suggestions -------------------------- Tom: >Also, using a 11 state doubles the amount of pg_clog I/O needed to >commit a collection of subtransactions. You have to write 11 as the >state of each commitable subtransaction, then commit the parent (write >01 as its state), then go back and change the state of each >subtransaction to 01. (Whether this last bit is done as part of parent >transaction commit, or during later inspections of the state of the >subtransaction, doesn't change the argument.) Yes, there may be more pg_clog I/O. But I don't think that it is doubled. I'd expect some locality; after all one page holds the status bits for 32K transactions. Tom: >I think it would be preferable to use only three states: active, >aborted, committed. The parent commit protocol is (1) write 10 as state >of each aborted subtransaction (this should be done as soon as the >subtransaction is known aborted, rather than delaying to parent commit); >(2) write 01 as state of parent (this is the atomic commit); (3) write >01 as state of each committed subtransaction. Readers who see 00 must >check the parent state; if the parent is committed then they have to go >back and recheck the child state (to see if it became "aborted" after >they looked). This halves the write traffic during a commit, at the >cost of additional read traffic when subtransaction state is checked in >a narrow window after the time of parent transaction commit. I believe >it nets out to be faster. Pro: Saves the fourth state for something different. Con: Does a pg_subtrans lookup even for main transactions (at least until xid < RecentGlobalXmin) Con: Needs subtrans tree navigation from parent to child. It's noticeably past midnight as I write this ... will reconsider that approach after I had same sleep. Tom: >But wait a moment. The child xid is by definition always greater than >(newer than) its parent. So if we consult pg_clog and find the >transaction marked committed, *and* the xid is before the window of XIDs >in our snapshot, then even if it's not a top-level xid, the parent must >be before our window too. Therefore we can conclude the transaction is >visible in our snapshot. So indeed there is a good-size range of xids >for which we'll never need to chase the parent link: everything before >the RecentGlobalXmin computed by GetSnapshotData. I think this has been integrated into this proposal (cf. shortcut 2). > (We do have to set >subtransactions to committed during parent commit to make this true; >we can't update them lazily. But I think that's okay.) This is not clear to me ... I'll try again later ... ------------------------------ That's all for now, folks. Fell free to comment. Sorry for the long post. Would you prefer such kind of stuff on a web page and just a short note with the URL to the list? ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace > a sub-transaction xid in xmin or xmax respectively with the > main-transaction xid at the same time. Otherwise we'd have to look > for the main xid, whenever a tuple is touched. This worries me --- it changes a safe operation (OR'ing in a commit bit) into an unsafe one that requires exclusive lock on the page containing the tuple. I'm also concerned that we'd now need a WAL entry to record the xid change (are we dependent on this change occurring for correctness? or is it only performance?) Perhaps it would be better to leave the tuple-commit bit unset until we have been able to change the clog state to 01 ("committed to everyone"). > Tom: >> I think it would be preferable to use only three states: active, >> aborted, committed. > Con: Needs subtrans tree navigation from parent to child. But only in the backend owning the transaction; there's no need for shared state that allows it. > Sorry for the long post. Would you prefer such kind of stuff on a web > page and just a short note with the URL to the list? No. This way it gets into the list archives. regards, tom lane
On Wed, 19 Mar 2003 11:18:38 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace >> a sub-transaction xid in xmin or xmax respectively with the >> main-transaction xid at the same time. Otherwise we'd have to look >> for the main xid, whenever a tuple is touched. > >This worries me --- it changes a safe operation (OR'ing in a commit bit) >into an unsafe one that requires exclusive lock on the page containing >the tuple. [Only talking about xmin here, but everything refers to xmax as well.] I was hoping we could set xmin atomically without holding a lock. If we can, we first set xmin to the main xid. The new state is still consistent; now it looks as if the change has been made directly by the main transaction and not by one of its subtransactions, which is ok after the main transaction has committed (we are only talking about cases where all interesting transactions have committed). As a second step we update the commit bit which is as safe as it is now. I see no concurrency problems. If two or more backends visit the same tuple, they either write the same value to the same position which doesn't hurt, or one sees the other's changes which is a good thing. So this boils down to whether setting the value of a properly aligned 32 bit variable in shared memory is an atomic operation on all supported platforms. I don't know enough about compilers to answer this question. >I'm also concerned that we'd now need a WAL entry to record >the xid change If the sequence is "first update xmin, then set the commit bit", we never have an inconsistent state. And if the change is lost, it can be redone by the next backend visiting the tuple. So I think we don't need a WAL entry. > (are we dependent on this change occurring for correctness? >or is it only performance?) The latter. >Perhaps it would be better to leave the tuple-commit bit unset until we >have been able to change the clog state to 01 ("committed to everyone"). At least we can fall back to this, if we can't find out how to update the xid safely. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > If the sequence is "first update xmin, then set the commit bit", we > never have an inconsistent state. And if the change is lost, it can > be redone by the next backend visiting the tuple. Not if the subtransaction log state has been removed as no longer needed. I think a WAL entry will be essential. (An alternative might be to keep subtransaction state as long as we keep pg_clog state, but that's pretty unpleasant too.) I think we'd be a lot better off to design this so that we don't need to alter heap tuple xmin values... regards, tom lane
> I see no concurrency problems. If two or more backends visit the same > tuple, they either write the same value to the same position which > doesn't hurt, or one sees the other's changes which is a good thing. AFAIR, on multi-CPU platforms it's possible that second transaction could see COMMITTED state but still old (subtrans id) in xmin: it's not guaranteed that changes made on CPU1 (V1 was changed first, then V2 was changed) will appear at the same order on CPU2 (V2 may come first, then V1). Vadim _____________________________________________________ Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC.
On Wed, 19 Mar 2003 13:00:07 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> And if the change is lost, it can >> be redone by the next backend visiting the tuple. > >Not if the subtransaction log state has been removed as no longer >needed. But this problem is not triggered by a tuple that has its xmin changed by a visitor and then looses that change again. We'd have the same problems with tuples that have never been visited (*). So we must make sure that pg_subtrans segments are not discarded as long as they are needed. (*) I guess your argument is: VACUUM makes sure that all tuples have been visited before it discards pg_subtrans segments. With my 4-state-proposal VACUUM can decide whether a pg_subtrans segment is still needed by only looking at pg_clog. > I think a WAL entry will be essential. I'm still in doubt, but it's moot (see below). >I think we'd be a lot better off to design this so that we don't need to >alter heap tuple xmin values... If Vadim remembers correctly we cannot safely change xmin, unless we want to grab a write lock. Ok, we'll not change xmin and we'll not set the commit bit before xmin is visible to all if xmin is a subtransaction. We can always add this performance hack later, if someone finds a safe implementation ... ServusManfred
Sorry I have a basic question. Was there any consensus we would introduce nested transactions (or savepoints) in the way currently discussed ? regards, Hiroshi Inoue Manfred Koizar wrote: > > On Wed, 19 Mar 2003 13:00:07 -0500, Tom Lane <tgl@sss.pgh.pa.us> > wrote: > >Manfred Koizar <mkoi-pg@aon.at> writes: > >> And if the change is lost, it can > >> be redone by the next backend visiting the tuple. > > > >Not if the subtransaction log state has been removed as no longer > >needed. > > But this problem is not triggered by a tuple that has its xmin changed > by a visitor and then looses that change again. We'd have the same > problems with tuples that have never been visited (*). So we must > make sure that pg_subtrans segments are not discarded as long as they > are needed. > > (*) I guess your argument is: VACUUM makes sure that all tuples have > been visited before it discards pg_subtrans segments. > > With my 4-state-proposal VACUUM can decide whether a pg_subtrans > segment is still needed by only looking at pg_clog. > > > I think a WAL entry will be essential. > > I'm still in doubt, but it's moot (see below). > > >I think we'd be a lot better off to design this so that we don't need to > >alter heap tuple xmin values... > > If Vadim remembers correctly we cannot safely change xmin, unless we > want to grab a write lock. Ok, we'll not change xmin and we'll not > set the commit bit before xmin is visible to all if xmin is a > subtransaction. We can always add this performance hack later, if > someone finds a safe implementation ... > > Servus > Manfred > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Sorry I have a basic question. > Was there any consensus we would introduce nested transactions > (or savepoints) in the way currently discussed ? I think we are a long way from saying we can or will actually do it. Error handling and resource management (eg locks) are a couple of other huge cans of worms that have yet to be opened. But certainly a solid design for the transaction logging and tuple validity checking is a necessary step. My feeling is that the right way to proceed is to nail down a paper design for each of the major aspects of the problem, before anyone actually spends any time coding. There would be little point in implementing subtransaction logging if we don't know how to do the other things. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Sorry I have a basic question. > > Was there any consensus we would introduce nested transactions > > (or savepoints) in the way currently discussed ? > > I think we are a long way from saying we can or will actually do it. > Error handling and resource management (eg locks) are a couple of other > huge cans of worms that have yet to be opened. But certainly a solid > design for the transaction logging and tuple validity checking is a > necessary step. Is the way to undo data rejected already ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue wrote: > Tom Lane wrote: > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > Sorry I have a basic question. > > > Was there any consensus we would introduce nested transactions > > > (or savepoints) in the way currently discussed ? > > > > I think we are a long way from saying we can or will actually do it. > > Error handling and resource management (eg locks) are a couple of other > > huge cans of worms that have yet to be opened. But certainly a solid > > design for the transaction logging and tuple validity checking is a > > necessary step. > > Is the way to undo data rejected already ? You mean abort subtransactions? Each subtransaction gets its own transaction id, so we just mark that as aborted --- there is no undo of tuples, though I had originally suggested that approach years ago. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > > Hiroshi Inoue wrote: > > Tom Lane wrote: > > > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > > Sorry I have a basic question. > > > > Was there any consensus we would introduce nested transactions > > > > (or savepoints) in the way currently discussed ? > > > > > > I think we are a long way from saying we can or will actually do it. > > > Error handling and resource management (eg locks) are a couple of other > > > huge cans of worms that have yet to be opened. But certainly a solid > > > design for the transaction logging and tuple validity checking is a > > > necessary step. > > > > Is the way to undo data rejected already ? > > You mean abort subtransactions? Each subtransaction gets its own > transaction id, so we just mark that as aborted --- there is no undo of > tuples, though I had originally suggested that approach years ago. Vadim planned to implement the savepoints functionality using UNDO mechanism. AFAIR it was never denied explicitly. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue wrote: > > > > I think we are a long way from saying we can or will actually do it. > > > > Error handling and resource management (eg locks) are a couple of other > > > > huge cans of worms that have yet to be opened. But certainly a solid > > > > design for the transaction logging and tuple validity checking is a > > > > necessary step. > > > > > > Is the way to undo data rejected already ? > > > > You mean abort subtransactions? Each subtransaction gets its own > > transaction id, so we just mark that as aborted --- there is no undo of > > tuples, though I had originally suggested that approach years ago. > > Vadim planned to implement the savepoints functionality > using UNDO mechanism. AFAIR it was never denied explicitly. If you go to the TODO.detail/transactions archive, there was discussion of using UNDO, and most felt that there were too many problems of having to manage the undo system, and that assigning a separate transaction id to every subtransaction was cleaner and more closely matched our existing system. It also has zero time for undo, which is the case we have for main transactions now. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Bruce Momjian wrote: >> You mean abort subtransactions? Each subtransaction gets its own >> transaction id, so we just mark that as aborted --- there is no undo of >> tuples, though I had originally suggested that approach years ago. > Vadim planned to implement the savepoints functionality > using UNDO mechanism. AFAIR it was never denied explicitly. Given all the flak we got about WAL growth during the time we had that code enabled, I think there's no chance that UNDO will be the preferred path. It's not workable with big transactions. There are other problems besides WAL bloat, too. I realized while I was working on the btree code a few weeks ago that it's fundamentally unfriendly to UNDO, because there are some operations you'd want to UNDO (viz, insertion of a leaf item pointing at a heap tuple) and some you would not (viz, splitting of index pages and subsequent insertion of items into upper tree levels). But the same WAL entry might include both kinds of operation. This could be got round, perhaps, but that code is overcomplicated already ... regards, tom lane
Tom Lane wrote: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Bruce Momjian wrote: > >> You mean abort subtransactions? Each subtransaction gets its own > >> transaction id, so we just mark that as aborted --- there is no undo of > >> tuples, though I had originally suggested that approach years ago. > > > Vadim planned to implement the savepoints functionality > > using UNDO mechanism. AFAIR it was never denied explicitly. > > Given all the flak we got about WAL growth during the time we had that > code enabled, I think there's no chance that UNDO will be the preferred > path. It's not workable with big transactions. > > There are other problems besides WAL bloat, too. I realized while I was > working on the btree code a few weeks ago that it's fundamentally > unfriendly to UNDO, because there are some operations you'd want to > UNDO (viz, insertion of a leaf item pointing at a heap tuple) and some > you would not (viz, splitting of index pages and subsequent insertion of > items into upper tree levels). But the same WAL entry might include > both kinds of operation. This could be got round, perhaps, but that > code is overcomplicated already ... I assumed the UNDO would have had to be in a separate place, or allow compression of the WAL file to keep needed UNDO stuff but get rid of unneeded stuff --- it was all quite complicated. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > > Hiroshi Inoue wrote: > > > > > I think we are a long way from saying we can or will actually do it. > > > > > Error handling and resource management (eg locks) are a couple of other > > > > > huge cans of worms that have yet to be opened. But certainly a solid > > > > > design for the transaction logging and tuple validity checking is a > > > > > necessary step. > > > > > > > > Is the way to undo data rejected already ? > > > > > > You mean abort subtransactions? Each subtransaction gets its own > > > transaction id, so we just mark that as aborted --- there is no undo of > > > tuples, though I had originally suggested that approach years ago. > > > > Vadim planned to implement the savepoints functionality > > using UNDO mechanism. AFAIR it was never denied explicitly. > > If you go to the TODO.detail/transactions archive, there was discussion > of using UNDO, and most felt that there were too many problems of having > to manage the undo system, This is closely related to the basics of PostgreSQL. Pleas don't decide it implicitly. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue wrote: > > > Vadim planned to implement the savepoints functionality > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > If you go to the TODO.detail/transactions archive, there was discussion > > of using UNDO, and most felt that there were too many problems of having > > to manage the undo system, > > This is closely related to the basics of PostgreSQL. > Pleas don't decide it implicitly. We took a vote and UNDO lost --- do you want to do another vote? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> There are other problems besides WAL bloat, too. > I assumed the UNDO would have had to be in a separate place, or allow > compression of the WAL file to keep needed UNDO stuff but get rid of > unneeded stuff --- it was all quite complicated. There is a mechanism in the XLOG code to distinguish "in transaction" from "out of transaction" WAL entries. The thing I had not realized before working on the btree code is that that mechanism is inadequate for UNDO. regards, tom lane
Bruce Momjian wrote: > > Hiroshi Inoue wrote: > > > > Vadim planned to implement the savepoints functionality > > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > > > If you go to the TODO.detail/transactions archive, there was discussion > > > of using UNDO, and most felt that there were too many problems of having > > > to manage the undo system, > > > > This is closely related to the basics of PostgreSQL. > > Pleas don't decide it implicitly. > > We took a vote and UNDO lost --- do you want to do another vote? Sorry I missed the vote. Where is it ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
> Given all the flak we got about WAL growth during the time we had that > code enabled, I think there's no chance that UNDO will be the preferred > path. It's not workable with big transactions. Somehow it's working in other DB systems. > There are other problems besides WAL bloat, too. I realized while I was > working on the btree code a few weeks ago that it's fundamentally > unfriendly to UNDO, because there are some operations you'd want to > UNDO (viz, insertion of a leaf item pointing at a heap tuple) and some > you would not (viz, splitting of index pages and subsequent insertion of > items into upper tree levels). But the same WAL entry might include > both kinds of operation. This could be got round, perhaps, but that > code is overcomplicated already ... Each access-method requires specific UNDO code (like REDO). Once again, it works in other DB-es. Vadim
Hiroshi Inoue wrote: > Bruce Momjian wrote: > > > > Hiroshi Inoue wrote: > > > > > Vadim planned to implement the savepoints functionality > > > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > > > > > If you go to the TODO.detail/transactions archive, there was discussion > > > > of using UNDO, and most felt that there were too many problems of having > > > > to manage the undo system, > > > > > > This is closely related to the basics of PostgreSQL. > > > Pleas don't decide it implicitly. > > > > We took a vote and UNDO lost --- do you want to do another vote? > > Sorry I missed the vote. Where is it ? I can't find the vote in the archive. As I remember, Vadim and a few others liked UNDO, while more liked the current approach. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Vadim Mikheev" <vmikheev@reveredata.com> writes: >> Given all the flak we got about WAL growth during the time we had that >> code enabled, I think there's no chance that UNDO will be the preferred >> path. It's not workable with big transactions. > Somehow it's working in other DB systems. Isn't limited UNDO segment size one of the most-hated management problems for Oracle databases? I don't see why we should want to duplicate one of their worst problems. regards, tom lane
> >> Given all the flak we got about WAL growth during the time we had that > >> code enabled, I think there's no chance that UNDO will be the preferred > >> path. It's not workable with big transactions. > > > Somehow it's working in other DB systems. > > Isn't limited UNDO segment size one of the most-hated management > problems for Oracle databases? I don't see why we should want to > duplicate one of their worst problems. How is it different from disk-space appetite of our non-overwriting smgr?! Before transaction commits you have to keep old data somewhere anyway. Let's not limit size of UNDO segments and that's it. Vadim
On Wed, Mar 19, 2003 at 09:38:21PM -0500, Tom Lane wrote: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Sorry I have a basic question. > > Was there any consensus we would introduce nested transactions > > (or savepoints) in the way currently discussed ? > > I think we are a long way from saying we can or will actually do it. > Error handling and resource management (eg locks) are a couple of other > huge cans of worms that have yet to be opened. Why do you say lock management is a can of worms? Locks are released at xact end by means of LockReleaseAll(), and this doesn't release all the locks the backend holds: only the locks that have the same xid that the committing transaction has (the mechanism has to be corrected for xact abort though). This is exactly what is needed for nested transactions: the ending subtransaction should release the locks it has, but the parent transaction should keep the ones it had. What's more, there's provision in LockAcquire() (storage/lmgr/lock.c line 602) for making possible that a backend holds a lock more than once in different XIDs. I know I'm missing something, but what is it? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "No es bueno caminar con un hombre muerto"
Tom mentioned to me that if you deadlock in a subtransaction, you have to release the locks of the subtransaction. That could be tricky. --------------------------------------------------------------------------- Alvaro Herrera wrote: > On Wed, Mar 19, 2003 at 09:38:21PM -0500, Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > Sorry I have a basic question. > > > Was there any consensus we would introduce nested transactions > > > (or savepoints) in the way currently discussed ? > > > > I think we are a long way from saying we can or will actually do it. > > Error handling and resource management (eg locks) are a couple of other > > huge cans of worms that have yet to be opened. > > Why do you say lock management is a can of worms? Locks are released at > xact end by means of LockReleaseAll(), and this doesn't release all the > locks the backend holds: only the locks that have the same xid that the > committing transaction has (the mechanism has to be corrected for xact > abort though). > > This is exactly what is needed for nested transactions: the ending > subtransaction should release the locks it has, but the parent > transaction should keep the ones it had. > > What's more, there's provision in LockAcquire() (storage/lmgr/lock.c > line 602) for making possible that a backend holds a lock more than once > in different XIDs. > > I know I'm missing something, but what is it? > > -- > Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) > "No es bueno caminar con un hombre muerto" > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > Why do you say lock management is a can of worms? Locks are released at > xact end by means of LockReleaseAll(), and this doesn't release all the > locks the backend holds: only the locks that have the same xid that the > committing transaction has (the mechanism has to be corrected for xact > abort though). Well, I'm not sure about this. Is it okay to release a child xact's locks when the child commits? Usually the point of holding a lock till commit is to ensure that other xacts will see your change as committed when they get in. But they won't see the child's work as committed if its parent is still open. On the other hand, an aborted child's locks had better get released (consider case where child is aborting to get out of a deadlock condition). This needs careful thought. There are indeed some first-cut provisions in the lock code for multiple transactions owned by a backend, but it'd be dangerous to assume that they are either correct or complete. The only case that's tested is for VACUUM to hold a lock across two transactions --- and this lock will not be held in the face of an error, so it's not an accurate representation of nested xacts anyway. Also see LW locks, which have no such management infrastructure ... regards, tom lane
On Thu, Mar 20, 2003 at 01:40:44AM -0500, Tom Lane wrote: > There are indeed some first-cut provisions in the lock code for multiple > transactions owned by a backend, but it'd be dangerous to assume that > they are either correct or complete. The only case that's tested is for > VACUUM to hold a lock across two transactions --- and this lock will not > be held in the face of an error, so it's not an accurate representation > of nested xacts anyway. Well, the only way to see if they are right or wrong is testing them. I will be trying to completely understand the transaction/block states so I can implement the needed state machinery for nested transaction; with this we can play with locks and the rest of resources. I think this transaction state game is the easiest part of nested transactions. > Also see LW locks, which have no such management infrastructure ... You can't end a transaction holding one of those; failure to do so is a programming error. The only way it is allowed is when elog(ERROR) is called. For that I propose that held_lwlocks is replaced with typedef struct held_lwlocks {TransactionId xid[MAX_SIMUL_LWLOCKS];LWLockId lid[MAX_SIMUL_LWLOCKS];int num_locks_held; } held_lwlocks; and LWReleaseAll() modified appropiately. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Nunca confiaré en un traidor. Ni siquiera si el traidor lo he creado yo" (Barón Vladimir Harkonnen)
Bruce Momjian wrote: > > Hiroshi Inoue wrote: > > Bruce Momjian wrote: > > > > > > Hiroshi Inoue wrote: > > > > > > Vadim planned to implement the savepoints functionality > > > > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > > > > > > > If you go to the TODO.detail/transactions archive, there was discussion > > > > > of using UNDO, and most felt that there were too many problems of having > > > > > to manage the undo system, > > > > > > > > This is closely related to the basics of PostgreSQL. > > > > Pleas don't decide it implicitly. > > > > > > We took a vote and UNDO lost --- do you want to do another vote? > > > > Sorry I missed the vote. Where is it ? > > I can't find the vote in the archive. As I remember, Vadim and a few > others liked UNDO, while more liked the current approach. As far as I remember there was no such vote or decision. Note that I'm not particularly on UNDO side but I don't think that the currently discussed way is much better than UNDO. Please make the advantage/disadvantages clear and let me understand the meaning of this thread. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
If there was no official vote, the conclusion came from the discussion that almost everyone wanted subtransactions without UNDO. I don't want to rehash it. If you want a vote, let's vote. Who wants subtransactions with UNDO and who wants it with a separate transaction id for every subtransaction? --------------------------------------------------------------------------- Hiroshi Inoue wrote: > Bruce Momjian wrote: > > > > Hiroshi Inoue wrote: > > > Bruce Momjian wrote: > > > > > > > > Hiroshi Inoue wrote: > > > > > > > Vadim planned to implement the savepoints functionality > > > > > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > > > > > > > > > If you go to the TODO.detail/transactions archive, there was discussion > > > > > > of using UNDO, and most felt that there were too many problems of having > > > > > > to manage the undo system, > > > > > > > > > > This is closely related to the basics of PostgreSQL. > > > > > Pleas don't decide it implicitly. > > > > > > > > We took a vote and UNDO lost --- do you want to do another vote? > > > > > > Sorry I missed the vote. Where is it ? > > > > I can't find the vote in the archive. As I remember, Vadim and a few > > others liked UNDO, while more liked the current approach. > > As far as I remember there was no such vote or decision. > Note that I'm not particularly on UNDO side but I don't > think that the currently discussed way is much better > than UNDO. Please make the advantage/disadvantages clear > and let me understand the meaning of this thread. > > regards, > Hiroshi Inoue > http://www.geocities.jp/inocchichichi/psqlodbc/ > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I found the discussion, all 146 messages: http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&threadm=200105231527.f4NFRvG07790%40candle.pha.pa.us&rnum=1&prev=/groups%3Fhl%3Den%26lr%3D%26ie%3DISO-8859-1%26q%3Dvote%2Bsubtransaction%26btnG%3DGoogle%2BSearch%26meta%3Dgroup%253Dcomp.databases.postgresql.* The discussion was more general and dealt with vacuum and undo. The basic discussion for subtransactions is whether you want to add UNDO logic just to do subtransactions, or whether it is easier to deal with multiple transaction ids per backend. Most thought the the second option was much easier. In fact, I had proposed a simpler UNDO capability that revisited tuples and set their XID to a fixed aborted XID to clean up aborted subtransactions, but most now like the multiple XID solution. --------------------------------------------------------------------------- Bruce Momjian wrote: > > If there was no official vote, the conclusion came from the discussion > that almost everyone wanted subtransactions without UNDO. > > I don't want to rehash it. If you want a vote, let's vote. > > Who wants subtransactions with UNDO and who wants it with a separate > transaction id for every subtransaction? > > --------------------------------------------------------------------------- > > Hiroshi Inoue wrote: > > Bruce Momjian wrote: > > > > > > Hiroshi Inoue wrote: > > > > Bruce Momjian wrote: > > > > > > > > > > Hiroshi Inoue wrote: > > > > > > > > Vadim planned to implement the savepoints functionality > > > > > > > > using UNDO mechanism. AFAIR it was never denied explicitly. > > > > > > > > > > > > > > If you go to the TODO.detail/transactions archive, there was discussion > > > > > > > of using UNDO, and most felt that there were too many problems of having > > > > > > > to manage the undo system, > > > > > > > > > > > > This is closely related to the basics of PostgreSQL. > > > > > > Pleas don't decide it implicitly. > > > > > > > > > > We took a vote and UNDO lost --- do you want to do another vote? > > > > > > > > Sorry I missed the vote. Where is it ? > > > > > > I can't find the vote in the archive. As I remember, Vadim and a few > > > others liked UNDO, while more liked the current approach. > > > > As far as I remember there was no such vote or decision. > > Note that I'm not particularly on UNDO side but I don't > > think that the currently discussed way is much better > > than UNDO. Please make the advantage/disadvantages clear > > and let me understand the meaning of this thread. > > > > regards, > > Hiroshi Inoue > > http://www.geocities.jp/inocchichichi/psqlodbc/ > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> If there was no official vote, the conclusion came from the discussion > that almost everyone wanted subtransactions without UNDO. > > I don't want to rehash it. If you want a vote, let's vote. > > Who wants subtransactions with UNDO and who wants it with a separate > transaction id for every subtransaction? Don't mess up things, Bruce - UNDO is not for subtransactions only! UNDO would allow immediate storage cleanup and vacuum would not be required anymore. Subtransactions/savepoints would be just "by-effect" of UNDO. (And, btw, how would you implement "implicit" savepoints with "separate subtrans id" approach?) But do we need any voting, actually? Is there anybody who want/ready implement UNDO functionality? No? Then there is nothing to vote about. (Though I personally consider "subtrans id-s" as "messing up messy transaction system". Messing up is always easier then re-designing). Vadim
Vadim Mikheev wrote: > > If there was no official vote, the conclusion came from the discussion > > that almost everyone wanted subtransactions without UNDO. > > > > I don't want to rehash it. If you want a vote, let's vote. > > > > Who wants subtransactions with UNDO and who wants it with a separate > > transaction id for every subtransaction? > > Don't mess up things, Bruce - UNDO is not for subtransactions only! > UNDO would allow immediate storage cleanup and vacuum would > not be required anymore. Subtransactions/savepoints would be just > "by-effect" of UNDO. (And, btw, how would you implement "implicit" > savepoints with "separate subtrans id" approach?) > > But do we need any voting, actually? Is there anybody who want/ready > implement UNDO functionality? No? Then there is nothing to vote about. > (Though I personally consider "subtrans id-s" as "messing up messy > transaction system". Messing up is always easier then re-designing). Yes, Vadim is right. The UNDO was much more than subtransactions, but actually a discussion comparing UNDO and the free-space map approach. It would help with subtransactions, but it is only tangentially related. I think the issue was: Do we want UNDO or FSM? We chose FSM but UNDO is still an option. Do we want UNDO just for subtransactions? That was pretty easily defeated, though I made an argument that you could do UNDO pretty cheaply when you have WAL ensuring crash recovery. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Do we want UNDO just for subtransactions? > That was pretty easily defeated, though I made an argument that you > could do UNDO pretty cheaply when you have WAL ensuring crash recovery. That argument was what got us into the early-7.1 WAL bloat problems. I don't think it's "pretty cheap" to have to hold the entire WAL for the length of your longest-running transactions. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Do we want UNDO just for subtransactions? > > That was pretty easily defeated, though I made an argument that you > > could do UNDO pretty cheaply when you have WAL ensuring crash recovery. > > That argument was what got us into the early-7.1 WAL bloat problems. > I don't think it's "pretty cheap" to have to hold the entire WAL for the > length of your longest-running transactions. With my idea, you wouldn't have to keep WAL around. Each backend would keep a list of tids or the relid (if lots of rows are changed) in local memory and UNDO on subtransaction abort. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> I see no concurrency problems. If two or more backends visit the same > tuple, they either write the same value to the same position which > doesn't hurt, or one sees the other's changes which is a good thing. AFAIR, on multi-CPU platforms it's possible that second transaction could see COMMITTED state but still old (subtrans id) in xmin: it's not guaranteed that changes made on CPU1 (V1 was changed first, then V2 was changed) will appear at the same order on CPU2 (V2 may come first, then V1). Vadim _____________________________________________________ Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC.
> Who wants subtransactions with UNDO and who wants it with a separate > transaction id for every subtransaction? I think there is at least one special case, that would largely profit from UNDO (or some other identical mechanism), namely an insert that causes a constraint violation. The standard way to program an "insert or update" is to do the one command that will succeed in more cases first, then on failure do the other. In PG this currently has to be done inefficiently by first doing the update and then doing the insert. If we had implicit subtransactions, I think people would start using the standard approach. It would probably not be too nice if that would always leave dead tuples and index entries around. Andreas
> In fact, I had proposed a simpler UNDO capability that revisited tuples > and set their XID to a fixed aborted XID to clean up aborted > subtransactions, but most now like the multiple XID solution. I think for the implicit subtransactions that we will want (with error codes comming) using a different xid for every command inside a transaction is not so sexy, no ? Andreas