Thread: Two-phase commit issues
I've started to look seriously at Heikki's patch for two-phase commit. There are a few issues that probably deserve discussion: * The major missing issue that I've come across so far is that subtransaction and multixact state isn't preserved across a crash. Assuming that we want to store only top-level XIDs in the shared-memory list of prepared XIDs (which I think is important), it is essential that crash restart rebuild the pg_subxact status for prepared transactions. The subxacts of a prepared xact have to be seen as still running, and they won't be unless the subxact links are there. Since subxact.c is designed to wipe all its state on restart, we need to recreate those entries. Fortunately this doesn't seem hard: the state file for a prepared xact will include all of its subxact XIDs, and we can just do SubTransSetParent() on them while rereading the state file. (AFAICS it's sufficient to make each subxact link directly to the top XID, even if there was a more complex hierarchy originally.) Similarly, we've got to reconstruct MultiXactIds that any prepared xacts are members of, else row-level locks taken out by prepared xacts won't be enforced correctly. I think this can be handled if we add to the state files a list of all MultiXactIds that each prepared xact belongs to, and then during restart forcibly recreate those MultiXactIds. (They would only be rebuilt with prepared XIDs, not any ordinary XIDs that might originally have been members.) This seems to require some new code in multixact.c, but not anything fundamentally difficult --- Alvaro, do you see any likely problems in this stuff? * The patch is designed to dump state files into WAL as well as onto disk. Why? Wouldn't it be better just to write and fsync the state file before reporting successful prepare? That would get rid of the need for checkpoint-time fsyncs. * I'm inclined to think that the "gid" identifiers for prepared transactions ought to be SQL identifiers (names), not string literals. Was there a particular reason for making them strings? * What are we going to do with GUC variables? My feeling is that the only sane answer is that PREPARE is the same as COMMIT as far as local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect on GUC state. Otherwise it's really unclear what to do. ConsiderSET myvar = foo;BEGIN;SET myvar = bar;PREPARE gid;SHOWmyvar; -- what do you see ... foo or bar?SET myvar = baz; -- is this even legal?ROLLBACK PREPARED gid;SHOWmyvar; -- now what do you see ... foo or baz? Since local GUC changes aren't going to be saved/restored across a crash anyway, I can't see a point in doing anything really complex. * There are some fairly ugly cases associated with creation and deletion of temporary tables as well. I think we might want to just decree that you can't PREPARE a transaction that included creating or dropping a temp table. Does anyone have much of a problem with that? regards, tom lane
Hi, One thing I would suggest is to start a global transaction in begin, not in prepare. That is way to be compliance with XA. Thanks Joe On 5/18/05 2:15 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I've started to look seriously at Heikki's patch for two-phase commit. > There are a few issues that probably deserve discussion: > > * The major missing issue that I've come across so far is that > subtransaction and multixact state isn't preserved across a crash. > Assuming that we want to store only top-level XIDs in the shared-memory > list of prepared XIDs (which I think is important), it is essential that > crash restart rebuild the pg_subxact status for prepared transactions. > The subxacts of a prepared xact have to be seen as still running, and > they won't be unless the subxact links are there. Since subxact.c is > designed to wipe all its state on restart, we need to recreate those > entries. Fortunately this doesn't seem hard: the state file for a > prepared xact will include all of its subxact XIDs, and we can just > do SubTransSetParent() on them while rereading the state file. (AFAICS > it's sufficient to make each subxact link directly to the top XID, even > if there was a more complex hierarchy originally.) Similarly, we've got > to reconstruct MultiXactIds that any prepared xacts are members of, else > row-level locks taken out by prepared xacts won't be enforced correctly. > I think this can be handled if we add to the state files a list of all > MultiXactIds that each prepared xact belongs to, and then during restart > forcibly recreate those MultiXactIds. (They would only be rebuilt with > prepared XIDs, not any ordinary XIDs that might originally have been > members.) This seems to require some new code in multixact.c, but not > anything fundamentally difficult --- Alvaro, do you see any likely > problems in this stuff? > > * The patch is designed to dump state files into WAL as well as onto > disk. Why? Wouldn't it be better just to write and fsync the state > file before reporting successful prepare? That would get rid of the > need for checkpoint-time fsyncs. > > * I'm inclined to think that the "gid" identifiers for prepared > transactions ought to be SQL identifiers (names), not string literals. > Was there a particular reason for making them strings? > > * What are we going to do with GUC variables? My feeling is that > the only sane answer is that PREPARE is the same as COMMIT as far as > local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect > on GUC state. Otherwise it's really unclear what to do. Consider > SET myvar = foo; > BEGIN; > SET myvar = bar; > PREPARE gid; > SHOW myvar; -- what do you see ... foo or bar? > SET myvar = baz; -- is this even legal? > ROLLBACK PREPARED gid; > SHOW myvar; -- now what do you see ... foo or baz? > Since local GUC changes aren't going to be saved/restored across a > crash anyway, I can't see a point in doing anything really complex. > > * There are some fairly ugly cases associated with creation and deletion > of temporary tables as well. I think we might want to just decree that > you can't PREPARE a transaction that included creating or dropping a > temp table. Does anyone have much of a problem with that? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote: > I've started to look seriously at Heikki's patch for two-phase commit. Hum. I started a few days ago doing some reviewing, with the intention of correcting some things here and there in order to present it all to you later, with a pre-filter to get some bugs out. > There are a few issues that probably deserve discussion: > > * The major missing issue that I've come across so far is that > subtransaction and multixact state isn't preserved across a crash. [...] > (AFAICS it's sufficient to make each subxact link directly to the top > XID, even if there was a more complex hierarchy originally.) Right, we don't care about the hierarchy; we know all those subXids were committed. > Similarly, we've got to reconstruct MultiXactIds that any prepared > xacts are members of, else row-level locks taken out by prepared xacts > won't be enforced correctly. I think this can be handled if we add to > the state files a list of all MultiXactIds that each prepared xact > belongs to, and then during restart forcibly recreate those > MultiXactIds. (They would only be rebuilt with prepared XIDs, not any > ordinary XIDs that might originally have been members.) This seems to > require some new code in multixact.c, but not anything fundamentally > difficult --- Alvaro, do you see any likely problems in this stuff? I'm not sure if it affects in any way that a Xid=1, which participates in a MultiXactId is seen as not prepared when Xid=2 prepares, which also participates in the same MultiXactId; if Xid=1 is prepared later, the MultiXactId needs to be restored with both Xids as participants. > * The patch is designed to dump state files into WAL as well as onto > disk. Why? Wouldn't it be better just to write and fsync the state > file before reporting successful prepare? That would get rid of the > need for checkpoint-time fsyncs. I made the same observation. > * I'm inclined to think that the "gid" identifiers for prepared > transactions ought to be SQL identifiers (names), not string literals. > Was there a particular reason for making them strings? Ditto. > * There are some fairly ugly cases associated with creation and deletion > of temporary tables as well. I think we might want to just decree that > you can't PREPARE a transaction that included creating or dropping a > temp table. Does anyone have much of a problem with that? Does this affect any of the other things that use the direct-fsync-no-WAL path in the smgr? -- Alvaro Herrera (<alvherre[a]surnet.cl>) "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
Alvaro Herrera <alvherre@surnet.cl> writes: > On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote: >> Similarly, we've got to reconstruct MultiXactIds that any prepared >> xacts are members of, else row-level locks taken out by prepared xacts >> won't be enforced correctly. I think this can be handled if we add to >> the state files a list of all MultiXactIds that each prepared xact >> belongs to, and then during restart forcibly recreate those >> MultiXactIds. (They would only be rebuilt with prepared XIDs, not any >> ordinary XIDs that might originally have been members.) > I'm not sure if it affects in any way that a Xid=1, which participates > in a MultiXactId is seen as not prepared when Xid=2 prepares, which also > participates in the same MultiXactId; if Xid=1 is prepared later, the > MultiXactId needs to be restored with both Xids as participants. What I had in mind was that each prepared xact's state file would just list the MultiXactIds it belongs to. While re-reading the state files after a crash, we'd construct the opposite lists (ie, which xacts belong to each MultiXactId) and then write appropriate entries into pg_multixact at completion of the re-read. I don't think it matters what order the xacts got prepared in. >> * There are some fairly ugly cases associated with creation and deletion >> of temporary tables as well. I think we might want to just decree that >> you can't PREPARE a transaction that included creating or dropping a >> temp table. Does anyone have much of a problem with that? > Does this affect any of the other things that use the direct-fsync-no-WAL > path in the smgr? I don't think so. It's not fsync that is at issue, really --- what I'm concerned about is operations that occur at commit time. For instance, considerCREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS;BEGIN;DROP TABLE foo;PREPARE gid; foo has an entry in the backend's on-commit-actions list, which the DROP marks for deletion at commit. It is unclear what to do with that entry at prepare. We can't really leave it active since the table can't be touched afterwards (the DROP took an exclusive lock, which we no longer own). But simply forgetting it is not good either; what if the prepared xact is later rolled back? Worse, what if some other backend does that rollback? If it was us doing the ROLLBACK PREPARED, we could at least in theory resurrect the ON COMMIT item, but there's no communication path to tell us to do so when someone else does the rollback. More generally, anything like this implies that a transaction that is no longer ours is holding locks on our temp tables. This is Really Bad. (Consider what happens if our backend tries to exit --- it'll want to delete those temp tables.) I am more than half tempted to put some kind of test into LockPersistAll to reject attempts to persist any lock of any kind on a temp table. I suppose the ideal solution would be something like what I was just suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT, and none of the resources associated with the temp table get assigned to the prepared xact. I am not sure how do-able that is, though. regards, tom lane
On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@surnet.cl> writes: > > On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote: > >> Similarly, we've got to reconstruct MultiXactIds that any prepared > >> xacts are members of, else row-level locks taken out by prepared xacts > >> won't be enforced correctly. I think this can be handled if we add to > >> the state files a list of all MultiXactIds that each prepared xact > >> belongs to, and then during restart forcibly recreate those > >> MultiXactIds. (They would only be rebuilt with prepared XIDs, not any > >> ordinary XIDs that might originally have been members.) > > > I'm not sure if it affects in any way that a Xid=1, which participates > > in a MultiXactId is seen as not prepared when Xid=2 prepares, which also > > participates in the same MultiXactId; if Xid=1 is prepared later, the > > MultiXactId needs to be restored with both Xids as participants. > > What I had in mind was that each prepared xact's state file would just > list the MultiXactIds it belongs to. Hm, this assumes the transaction knows what MultiXactIds it belongs to. This is not true, is it? I'm not sure how to find that out. > >> * There are some fairly ugly cases associated with creation and deletion > >> of temporary tables as well. I think we might want to just decree that > >> you can't PREPARE a transaction that included creating or dropping a > >> temp table. Does anyone have much of a problem with that? > > > Does this affect any of the other things that use the direct-fsync-no-WAL > > path in the smgr? > > I don't think so. It's not fsync that is at issue, really --- what I'm > concerned about is operations that occur at commit time. I think I confused the issue with using local buffers instead of shared buffers, for example where btree creation skips WAL. But certainly it doesn't have anything to do with it. > For instance, consider > CREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS; > BEGIN; > DROP TABLE foo; > PREPARE gid; > foo has an entry in the backend's on-commit-actions list, which the DROP > marks for deletion at commit. It is unclear what to do with that entry > at prepare. We can't really leave it active since the table can't be > touched afterwards (the DROP took an exclusive lock, which we no longer > own). But simply forgetting it is not good either; what if the prepared > xact is later rolled back? Worse, what if some other backend does that > rollback? If it was us doing the ROLLBACK PREPARED, we could at least > in theory resurrect the ON COMMIT item, but there's no communication > path to tell us to do so when someone else does the rollback. Hmm. I think not being able to use temp tables is an important restriction. I can see the implementation issue though. > More generally, anything like this implies that a transaction that is no > longer ours is holding locks on our temp tables. This is Really Bad. > (Consider what happens if our backend tries to exit --- it'll want to > delete those temp tables.) I am more than half tempted to put some kind > of test into LockPersistAll to reject attempts to persist any lock of > any kind on a temp table. Maybe the restriction could be lighter -- what if the prepared transaction inserts tuples on a temp table, for example. It's inconsistent, I think, that a temp table could have tuples on it that suddenly appear when some other backend commits my prepared transaction. > I suppose the ideal solution would be something like what I was just > suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT, > and none of the resources associated with the temp table get assigned > to the prepared xact. I am not sure how do-able that is, though. That'd require labelling tuples in temp tables with a different Xid, than non-temp tables, no? It'd get strange very quickly. -- Alvaro Herrera (<alvherre[a]surnet.cl>) "La persona que no quería pecar / estaba obligada a sentarseen duras y empinadas sillas / desprovistas, por ciertode blandosatenuantes" (Patricio Vogel)
Alvaro Herrera <alvherre@surnet.cl> writes: > On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote: >> What I had in mind was that each prepared xact's state file would just >> list the MultiXactIds it belongs to. > Hm, this assumes the transaction knows what MultiXactIds it belongs to. > This is not true, is it? I'm not sure how to find that out. [ thinks about that for a bit... ] I had been thinking we could just track it locally in each backend, but that won't do for the case where someone adds you to a MultiXactId without your knowledge. Seems like we'd have to actually scan the contents of pg_multixact? Yech. > Maybe the restriction could be lighter -- what if the prepared > transaction inserts tuples on a temp table, for example. It's > inconsistent, I think, that a temp table could have tuples on it that > suddenly appear when some other backend commits my prepared transaction. Yeah, there are all sorts of interesting problems there :-(. I think we'd be best off to punt for the moment. I think we could enforce that a transaction being PREPAREd hasn't touched any temp tables at all, by checking that it holds no locks on such tables. regards, tom lane
On Wed, 18 May 2005, Tom Lane wrote: > * The major missing issue that I've come across so far is that > subtransaction and multixact state isn't preserved across a crash. > Assuming that we want to store only top-level XIDs in the shared-memory > list of prepared XIDs (which I think is important), it is essential that > crash restart rebuild the pg_subxact status for prepared transactions. > The subxacts of a prepared xact have to be seen as still running, and > they won't be unless the subxact links are there. Since subxact.c is > designed to wipe all its state on restart, we need to recreate those > entries. Fortunately this doesn't seem hard: the state file for a > prepared xact will include all of its subxact XIDs, and we can just > do SubTransSetParent() on them while rereading the state file. (AFAICS > it's sufficient to make each subxact link directly to the top XID, even > if there was a more complex hierarchy originally.) Similarly, we've got > to reconstruct MultiXactIds that any prepared xacts are members of, else > row-level locks taken out by prepared xacts won't be enforced correctly. > I think this can be handled if we add to the state files a list of all > MultiXactIds that each prepared xact belongs to, and then during restart > forcibly recreate those MultiXactIds. (They would only be rebuilt with > prepared XIDs, not any ordinary XIDs that might originally have been > members.) This seems to require some new code in multixact.c, but not > anything fundamentally difficult --- Alvaro, do you see any likely > problems in this stuff? The subtransaction part is in fact there already, and it's done just like you described. RecoverPreparedTransactions function reads the subxids from the state file and calls SubTransSetParent for them. As Alvaro pointed out elsewhere, the multixacts are harder because a backend doesn't know which multixactids it belongs to. AFAICS, the most straightforward solution is to xlog every CreateMultixact call, so that the multixact slru files can be completely reconstructed on recovery. > * The patch is designed to dump state files into WAL as well as onto > disk. Why? Wouldn't it be better just to write and fsync the state > file before reporting successful prepare? That would get rid of the > need for checkpoint-time fsyncs. Performance and correctness. There mustn't be a valid state file on the disk before the WAL entries of that transactions are on disk. Otherwise, the recovery might recover a transaction that in fact aborted right after it wrote the state file. If we fsync the WAL prepare record first, and state file second, a crash in between would make it impossible to recover the transaction though the WAL says it's prepared. WAL logging the state file completely saves us one fsync. The state files are usually small, say < 1 kb, so the tradeoff to write it twice and save one fsync is probably well worth it. Third, we have to cater for PITR. I haven't given it much thought, but if we want to do log shipping and PITR, I believe we must have everything in the WAL. > * I'm inclined to think that the "gid" identifiers for prepared > transactions ought to be SQL identifiers (names), not string literals. > Was there a particular reason for making them strings? Sure. No Reason. While you're at it, do you think it's possible to make it unlimited size? I couldn't think of a simple way. > * What are we going to do with GUC variables? My feeling is that > the only sane answer is that PREPARE is the same as COMMIT as far as > local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect > on GUC state. Otherwise it's really unclear what to do. Consider > SET myvar = foo; > BEGIN; > SET myvar = bar; > PREPARE gid; > SHOW myvar; -- what do you see ... foo or bar? > SET myvar = baz; -- is this even legal? > ROLLBACK PREPARED gid; > SHOW myvar; -- now what do you see ... foo or baz? > Since local GUC changes aren't going to be saved/restored across a > crash anyway, I can't see a point in doing anything really complex. > > * There are some fairly ugly cases associated with creation and deletion > of temporary tables as well. I think we might want to just decree that > you can't PREPARE a transaction that included creating or dropping a > temp table. Does anyone have much of a problem with that? I think the safest way to handle the GUC case as well is to just refuse to prepare a transaction that has changed local GUC variables. Another possibility is to rethink the contract of PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to a state where you can't do anything else than COMMIT/ROLLBACK the prepared transaction, we could do more sensible things with GUC and temp tables. That would have complications of it's own though. What would happen if another backend then tries to COMMIT/ROLLBACK the transaction the original backend is still tied to? - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > As Alvaro pointed out elsewhere, the multixacts are harder because a > backend doesn't know which multixactids it belongs to. AFAICS, the most > straightforward solution is to xlog every CreateMultixact call, so that > the multixact slru files can be completely reconstructed on recovery. I realized this morning that in fact it *can't* know that, since even after a particular transaction commits it's still possible for others to add it to new multixacts. In the case of a prepared xact it would continue to get added to new multixacts indefinitely :-(. So the idea of recording info about this in the state files is clearly a loser. I think we will indeed have to start xlogging multixact operations. > Third, we have to cater for PITR. I haven't given it much thought, but if > we want to do log shipping and PITR, I believe we must have everything in > the WAL. Hmm. All your other arguments for WAL-logging a prepare are bogus, but this one seems real. (It's also a reason why multixact stuff needs to be xlogged, I guess.) >> * I'm inclined to think that the "gid" identifiers for prepared >> transactions ought to be SQL identifiers (names), not string literals. >> Was there a particular reason for making them strings? > Sure. No Reason. While you're at it, do you think it's possible to make it > unlimited size? I couldn't think of a simple way. Actually, one reason for wanting them to be identifiers is so that there's a principled reason for saying what the max length is ;-) > I think the safest way to handle the GUC case as well is to just refuse to > prepare a transaction that has changed local GUC variables. That seems unnecessarily restrictive. > Another possibility is to rethink the contract of PREPARE TRANSACTION and > COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to > a state where you can't do anything else than COMMIT/ROLLBACK the prepared > transaction, we could do more sensible things with GUC and temp tables. > That would have complications of it's own though. What would happen if > another backend then tries to COMMIT/ROLLBACK the transaction the original > backend is still tied to? Yeah, I do not think this is a useful answer. Allowing the commit to happen somewhere else and restricting what a prepared xact can do with temp tables seems much more useful in practice. regards, tom lane
Tom Lane wrote: > I've started to look seriously at Heikki's patch for two-phase commit. > There are a few issues that probably deserve discussion: > > * The major missing issue that I've come across so far is that > subtransaction and multixact state isn't preserved across a crash. I am a little confused by this. How does two-phase commit add extra requirements on crash recovery? I understand a crashed server might be involved in a two-phase commit, but doesn't the transaction just roll back? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am a little confused by this. How does two-phase commit add extra > requirements on crash recovery? Uh, that's more or less the entire *POINT*. Once an open transaction is prepared, it's supposed to survive a server crash. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am a little confused by this. How does two-phase commit add extra > > requirements on crash recovery? > > Uh, that's more or less the entire *POINT*. Once an open transaction is > prepared, it's supposed to survive a server crash. Wow. This is much more than I thought we were going to do. I thought if something failed after the prepare we were just going to inform the administrator and give up. Becuase you are writing status file to the disk, it seems you are trying to recover from a crash and roll forward. What cases would we actually fail to recover from a crash after a PREPARE? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Uh, that's more or less the entire *POINT*. Once an open transaction is >> prepared, it's supposed to survive a server crash. > Wow. This is much more than I thought we were going to do. If we tried to claim that anything less was two-phase commit, we'd be laughed off the face of the planet ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Uh, that's more or less the entire *POINT*. Once an open transaction is > >> prepared, it's supposed to survive a server crash. > > > Wow. This is much more than I thought we were going to do. > > If we tried to claim that anything less was two-phase commit, we'd be > laughed off the face of the planet ... Well, based on past discussions, our TODO has: * Add two-phase commit This will involve adding a way to respond to commit failure by either taking the server into offline/readonlymode or notifying the administrator As I remember, you said two-phase wasn't 100% reliable and we just needed a way to report failures. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > As I remember, you said two-phase wasn't 100% reliable and we just > needed a way to report failures. [ Shrug... ] I remain of the opinion that 2PC is a solution in search of a problem, because it does not solve the single point of failure issue (just moves same from the database to the 2PC controller). But some people want it anyway, and they aren't going to be satisfied that we are an "enterprise grade" database until we can check off this particular bullet point. As long as the implementation doesn't impose any significant costs when not being used (which AFAICS Heikki's method doesn't), I think we gotta hold our noses and do it. regards, tom lane
Exactly. A 2PC expects every participant that makes it to the prepare to commit phase to survive a server restart, controlleror otherwise. Anything less is not 2PC. <br /><br /> Jordan Henderson<br /><br /> On Fri, 2005-05-20 at 12:07-0400, Tom Lane wrote: <blockquote type="CITE"><pre> <font color="#000000">Bruce Momjian <<a href="mailto:pgman@candle.pha.pa.us">pgman@candle.pha.pa.us</a>> writes:</font> <font color="#000000">> I am a little confused by this. How does two-phase commit add extra</font> <font color="#000000">> requirements on crash recovery?</font> <font color="#000000">Uh, that's more or less the entire *POINT*. Once an open transaction is</font> <font color="#000000">prepared, it's supposed to survive a server crash.</font> <font color="#000000"> regards, tom lane</font> <font color="#000000">---------------------------(end of broadcast)---------------------------</font> <font color="#000000">TIP 9: the planner will ignore your desire to choose an index scan if your</font> <font color="#000000"> joining column's datatypes do not match</font> </pre></blockquote>
Tom Lane wrote: > [ Shrug... ] I remain of the opinion that 2PC is a solution in search > of a problem, because it does not solve the single point of failure > issue (just moves same from the database to the 2PC controller). > But some people want it anyway, and they aren't going to be satisfied > that we are an "enterprise grade" database until we can check off this > particular bullet point. As long as the implementation doesn't impose > any significant costs when not being used (which AFAICS Heikki's method > doesn't), I think we gotta hold our noses and do it. I thought the primary reason for having 2PC is to be able to participate in a heterogenous transaction, e.g. with a non-Postgres database/other types of resource managers? 2PC is mostly about how to make these cross-RM transactions [appear] atomic. Redundancy is not covered by 2PC protocol. -- dave
Tom, > > [ Shrug... ] I remain of the opinion that 2PC is a solution in search > > of a problem, because it does not solve the single point of failure > > issue (just moves same from the database to the 2PC controller). > > But some people want it anyway, and they aren't going to be satisfied > > that we are an "enterprise grade" database until we can check off this > > particular bullet point. As long as the implementation doesn't impose > > any significant costs when not being used (which AFAICS Heikki's method > > doesn't), I think we gotta hold our noses and do it. 2PC is a key to supporting 3rd-party replication tools, like C-JDBC. And is useful for some other use cases, like slow-WAN-based financial transactions. We know you don't like it, Tom. ;-) -- Josh Berkus Aglio Database Solutions San Francisco
On Friday 20 May 2005 18:14, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > As I remember, you said two-phase wasn't 100% reliable and we just > > needed a way to report failures. > > [ Shrug... ] I remain of the opinion that 2PC is a solution in search > of a problem, because it does not solve the single point of failure > issue (just moves same from the database to the 2PC controller). You're right. 2PC to coordinate replicas of the same data is not that interesting. It is however most interesting when coordination updates to different objects such as (i) a central database server and a local staging area or (ii) a database server and transactional queues in a workflow-style app. > But some people want it anyway, and they aren't going to be satisfied > that we are an "enterprise grade" database until we can check off this > particular bullet point. As long as the implementation doesn't impose > any significant costs when not being used (which AFAICS Heikki's method > doesn't), I think we gotta hold our noses and do it. It is a definitly in the check list if you're shopping for a database to go with your buzzword compliant J2EE app server. :-) -- Jose Orlando Pereira
On Thu, 19 May 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > >>> * I'm inclined to think that the "gid" identifiers for prepared >>> transactions ought to be SQL identifiers (names), not string literals. >>> Was there a particular reason for making them strings? > >> Sure. No Reason. While you're at it, do you think it's possible to make it >> unlimited size? I couldn't think of a simple way. > > Actually, one reason for wanting them to be identifiers is so that > there's a principled reason for saying what the max length is ;-) I took a closer look at the JTA spec and saw that the Xid, which is translated to a gid in the jdbc driver, consists of a format identifier (32-bit int), a branch qualifier (max 64 bytes) and a global transaction identifier (max 64 bytes). That means that gid needs to hold 132 raw bytes minimum. Also, it would be nice if the driver could send the gid as a bytea, without converting it to a string. Similar to using parameter markers and parse / bind messages with regular queries. That would require a change in the FE/BE protocol, right? The branch qualifier and global transaction id structure comes from the OSI CCR specification. Anyone here that knows more about OSI CCR? - Heikki
On Saturday 21 May 2005 03:37, Josh Berkus wrote: > 2PC is a key to supporting 3rd-party replication tools, like C-JDBC. I don't think C-JDBC requires 2PC for replication. Mixed up acronyms maybe? :) -- Jose Orlando Pereira
On Tue, 7 Jun 2005, Alvaro Herrera wrote: > On Sat, May 21, 2005 at 06:57:24PM +0300, Heikki Linnakangas wrote: > > Heikki, > >> I took a closer look at the JTA spec and saw that the Xid, which is >> translated to a gid in the jdbc driver, consists of a format identifier >> (32-bit int), a branch qualifier (max 64 bytes) and a global transaction >> identifier (max 64 bytes). >> >> That means that gid needs to hold 132 raw bytes minimum. >> >> Also, it would be nice if the driver could send the gid as a bytea, >> without converting it to a string. Similar to using parameter markers >> and parse / bind messages with regular queries. That would require a >> change in the FE/BE protocol, right? >> >> The branch qualifier and global transaction id structure comes from >> the OSI CCR specification. Anyone here that knows more about OSI CCR? > > I think I'm going to try to do this by hacking the lexer some (this has > the added benefit of me learning a little about lexers). Do you have an > URL to those specs you mention? How authoritative they are, I mean, > they are not the SQL spec, right? The JTA spec http://java.sun.com/products/jta/ Relevant X/Open XA documents: http://www.opengroup.org/bookstore/catalog/tp.htm See especially page 19 of the "Distributed Transaction Processing: The XA Specification", it contains xa.h header file that specifies the format of the transaction identifier. It matches with the format in the JTA spec, but the JTA spec also mentions the OCI CCR format which I haven't been able to find: http://java.sun.com/products/jta/jta-1_0_1B-doc/javax/transaction/xa/Xid.html In addition to those two, I bumped into RFC2371. It basically allows any format. I don't have access to the SQL spec, so I can't comment on that. I'd regard the XA spec as the most authoritative standard in the field. - Heikki
On 6/11/05, Heikki Linnakangas wrote: > > It matches with the format in the JTA spec, but the JTA spec also mentions > the OCI CCR format The "OSI" CCR format, which appears to refer to ISO/IEC 9805-1. ISO/IEC 9805-1:1998 15-12-1998 Information technology - Open Systems Interconnection - Protocol for the Commitment, Concurrency and Recovery service element: Protocol specification This standard is to be applied by reference from other specifications. Specifies a use of the ACSE, Presentation adn Session services to carry the CCR semantics. Specifies the static and dynamic conformance requirements for systems implementing these procedures. Specifies the protocol elements that support the following functional untis: - static commitment; - dynamic commitment; - read only; - one-phase commitment; - cancel; and overlapped recovery. Unfortunately that standard is not included in my universities subscription to ISO standards so I can't tell you what it says about the format. Jochem
On Sat, 11 Jun 2005, Jochem van Dieten wrote: > The "OSI" CCR format, which appears to refer to ISO/IEC 9805-1. > > ISO/IEC 9805-1:1998 > 15-12-1998 > Information technology - Open Systems Interconnection - Protocol for > the Commitment, Concurrency and Recovery service element: Protocol > specification > > This standard is to be applied by reference from other specifications. > Specifies a use of the ACSE, Presentation adn Session services to > carry the CCR semantics. Specifies the static and dynamic conformance > requirements for systems implementing these procedures. Specifies the > protocol elements that support the following functional untis: - > static commitment; - dynamic commitment; - read only; - one-phase > commitment; - cancel; and overlapped recovery. > > > Unfortunately that standard is not included in my universities > subscription to ISO standards so I can't tell you what it says about > the format. Great, thanks anyway! Anyone here with access to the content? - Heikki