Thread: 2-phase commit
Hi, As the 7.4 beta rolls on, I thought now would be a good time to start talking about the future. I have a potential need in the future for distributed transactions (XA). To get that from Postgres, I'd need two-phase commit, I think. There is someone working on such a project (<http://snaga.org/pgsql/>), but last time it was discussed here, it received a rather lukewarm reception (see, e.g., the thread starting at <http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>). While at OSCON, I had a discussion with Joe Conway, Bruce Momjian, and Greg Sabino Mullane about 2PC. Various people expressed various opinions on the topic, but I think we agreed on the following. The relevant folks can correct me if I'm wrong: Two-phase commit has theoretical problems, but it is implemented in several "enterprise" RDBMS. 2PC is something needed by certain kinds of clients (especially those with transaction managers), so if PostgreSQL doesn't have it, PostgreSQL just won't get supported in that arena. Someone is already working on 2PC, but may feel unwanted due to the reactions last heard on the topic, and may not continue working unless he gets some support. What is a necessary condition for such support is to get some idea of what compromises 2PC might impose, and thereafter to try to determine which such compromises, if any, are acceptable ones. I think the idea here is that, while in most cases a "pretty-good" implementation of a desirable feature might get included in the source on the grounds that it can always be improved upon later, something like 2PC has the potential to do great harm to an otherwise reliable transaction manager. So the arguments about what to do need to be aired in advance. I (perhaps foolishly) volunteered to undertake to collect the arguments in various directions, on the grounds that I can contribute no code, but have skin made of asbestos. I thought I'd try to collect some information about what people think the problems and potentially acceptable compromises are, to see if there is some way to understand what can and cannot be contemplated for 2PC. I'll include in any such outline the remarks found in the -hackers thread referenced above. Any objections? A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
In an attempt to throw the authorities off his trail, andrew@libertyrms.info (Andrew Sullivan) transmitted: > As the 7.4 beta rolls on, I thought now would be a good time to start > talking about the future. > > I have a potential need in the future for distributed transactions > (XA). To get that from Postgres, I'd need two-phase commit, I think. > There is someone working on such a project > (<http://snaga.org/pgsql/>), but last time it was discussed here, it > received a rather lukewarm reception (see, e.g., the thread starting > at > <http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>). Interesting/positive news on this front; the XA specification documents are now all available in PDF form "freely", from the Open Group, where they used to be fairly pricey. <http://www.opengroup.org/publications/catalog/tp.htm> Another notable XA documentation source is here... <http://www.middleware.net/tuxedo/resources/XA_Documentation.html> Two interesting implications of XA support would be that there could be some "congruence of interests" that would arise regarding two vendors: - XA is essentially based on the API of BEA Tuxedo. I'm told they include a simple database system with Tuxedo, but nothing particularly wonderful. (Who thinks of BEA as a DBMS vendor???) They might have interest in bundling something better... - The main Tuxedo reseller that I am aware of is PeopleSoft, who use it for their "high traffic" clients. Anyone that has seen news lately knows that they and Oracle aren't exactly "best pals" these days; having another DB option could be helpful to them... -- (format nil "~S@~S" "aa454" "freenet.carleton.ca") http://www3.sympatico.ca/cbbrowne/tpmonitor.html "In order to make an apple pie from scratch, you must first create the universe." -- Carl Sagan, Cosmos
On Tue, Aug 26, 2003 at 08:04:13PM -0400, Christopher Browne wrote: > > Interesting/positive news on this front; the XA specification > documents are now all available in PDF form "freely", from the Open > Group, where they used to be fairly pricey. A step in the right direction, but AFAIC it's too little, too late. The impression I get, at least, is that it's as good as dead now: Java may use it, but it hides the details anyway so it might as well not be there--the Java way is to standardize the API but nothing that goes "on the wire". Lots of proprietary middleware uses XA, but from what I hear there are enough subtle differences to make mixing-and-matching of products risky at best--the proprietary way is to bundle products that will work at least marginally together, and relegate standards to a bullshit point in the PowerPoint presentations. "Based on industry standard" means about the same as "based on a true story." Then there's the fact that the necessary followup standards never got anywhere, and the fact that XA doesn't cope with threading really well. Don't get me wrong, XA support may well be a good thing. But at this stage, personally I'd go for a good 2PC implementation first and worry about supporting XA later. Jeroen
I haven't seen any comment on this email. From our previous discussion of 2-phase commit, there was concern that the failure modes of 2-phase commit were not solvable. However, I think multi-master replication is going to have similar non-solvable failure modes, yet people still want multi-master replication. We have had several requests for 2-phase commit in the past month. I think we should encourage the Japanese group to continue on their 2-phase commit patch to be included in 7.5. Yes, it will have non-solvable failure modes, but let's discuss them and find an appropriate way to deal with the failures. --------------------------------------------------------------------------- Andrew Sullivan wrote: > Hi, > > As the 7.4 beta rolls on, I thought now would be a good time to start > talking about the future. > > I have a potential need in the future for distributed transactions > (XA). To get that from Postgres, I'd need two-phase commit, I think. > There is someone working on such a project > (<http://snaga.org/pgsql/>), but last time it was discussed here, it > received a rather lukewarm reception (see, e.g., the thread starting > at > <http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>). > > While at OSCON, I had a discussion with Joe Conway, Bruce Momjian, > and Greg Sabino Mullane about 2PC. Various people expressed various > opinions on the topic, but I think we agreed on the following. The > relevant folks can correct me if I'm wrong: > > Two-phase commit has theoretical problems, but it is implemented in > several "enterprise" RDBMS. 2PC is something needed by certain kinds > of clients (especially those with transaction managers), so if > PostgreSQL doesn't have it, PostgreSQL just won't get supported in > that arena. Someone is already working on 2PC, but may feel unwanted > due to the reactions last heard on the topic, and may not continue > working unless he gets some support. What is a necessary condition > for such support is to get some idea of what compromises 2PC might > impose, and thereafter to try to determine which such compromises, if > any, are acceptable ones. > > I think the idea here is that, while in most cases a "pretty-good" > implementation of a desirable feature might get included in the > source on the grounds that it can always be improved upon later, > something like 2PC has the potential to do great harm to an otherwise > reliable transaction manager. So the arguments about what to do need > to be aired in advance. > > I (perhaps foolishly) volunteered to undertake to collect the > arguments in various directions, on the grounds that I can contribute > no code, but have skin made of asbestos. I thought I'd try to > collect some information about what people think the problems and > potentially acceptable compromises are, to see if there is some way > to understand what can and cannot be contemplated for 2PC. I'll > include in any such outline the remarks found in the -hackers thread > referenced above. Any objections? > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Liberty RMS Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > I haven't seen any comment on this email. > > From our previous discussion of 2-phase commit, there was concern that > the failure modes of 2-phase commit were not solvable. However, I think > multi-master replication is going to have similar non-solvable failure > modes, yet people still want multi-master replication. > > We have had several requests for 2-phase commit in the past month. I > think we should encourage the Japanese group to continue on their > 2-phase commit patch to be included in 7.5. Yes, it will have > non-solvable failure modes, but let's discuss them and find an > appropriate way to deal with the failures. FWIW, Oracle 8's manual for the recovery of a distributed tx where the coordinator never comes back on line is: https://www.ifi.uni-klu.ac.at/Public/Documentation/oracle/product/8.0.3/doc/server803/A54643_01/ch_intro.htm#7783 "If a database must be recovered to a point in the past, Oracle's recovery facilities allow database administrators at other sites to return their databases to the earlier point in time also. This ensures that the global database remains consistent." So it seems, for Oracle 8 at least, PITR is the method of recovery for cohorts after unrecoverable coordinator failure. Ugly and yet probably a prerequisite. Mike Mascari mascarm@mascari.com
Mike Mascari wrote: > Bruce Momjian wrote: > > I haven't seen any comment on this email. > > > > From our previous discussion of 2-phase commit, there was concern that > > the failure modes of 2-phase commit were not solvable. However, I think > > multi-master replication is going to have similar non-solvable failure > > modes, yet people still want multi-master replication. > > > > We have had several requests for 2-phase commit in the past month. I > > think we should encourage the Japanese group to continue on their > > 2-phase commit patch to be included in 7.5. Yes, it will have > > non-solvable failure modes, but let's discuss them and find an > > appropriate way to deal with the failures. > > FWIW, Oracle 8's manual for the recovery of a distributed tx where the > coordinator never comes back on line is: > > https://www.ifi.uni-klu.ac.at/Public/Documentation/oracle/product/8.0.3/doc/server803/A54643_01/ch_intro.htm#7783 > > "If a database must be recovered to a point in the past, Oracle's > recovery facilities allow database administrators at other sites to > return their databases to the earlier point in time also. This ensures > that the global database remains consistent." > > So it seems, for Oracle 8 at least, PITR is the method of recovery for > cohorts after unrecoverable coordinator failure. Yep, I assume PITR would be the solution for most failure cases --- very ugly of course. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > From our previous discussion of 2-phase commit, there was concern that > the failure modes of 2-phase commit were not solvable. However, I think > multi-master replication is going to have similar non-solvable failure > modes, yet people still want multi-master replication. No. The real problem with 2PC in my mind is that its failure modes occur *after* you have promised commit to one or more parties. In multi-master, if you fail you know it before you have told the client his data is committed. regards, tom lane
On Tue, Sep 09, 2003 at 08:38:41PM -0400, Bruce Momjian wrote: > > Yep, I assume PITR would be the solution for most failure cases --- very > ugly of course. Anything can be broken in some way, if bad luck is willing to work hard enough. In at least one, ah, competing company I know of, employees are allowed by the legal people to say "assured" but not "guaranteed" for precisely this reason. First thing is an acceptable failure mode, then you try to narrow its chances of occurring. And if worst comes to worst, one example of an acceptable failure mode is "when in danger or doubt, run in circles, scream and shout." Jeroen
> > From our previous discussion of 2-phase commit, there was concern that > > the failure modes of 2-phase commit were not solvable. However, I think > > multi-master replication is going to have similar non-solvable failure > > modes, yet people still want multi-master replication. > > No. The real problem with 2PC in my mind is that its failure modes > occur *after* you have promised commit to one or more parties. In > multi-master, if you fail you know it before you have told the client > his data is committed. Hmm ? The appl cannot take the first phase commit as its commit info. It needs to wait for the second phase commit. The second phase is only finished when all coservers have reported back. 2PC is synchronous. The problems with 2PC are when after second phase commit was sent to all servers and before all report back one of them becomes unreachable/down ... (did it receive and do the 2nd commit or not) Such a transaction must stay open until the coserver is reachable again or an administrator committed/aborted it. It is multi master replication that usually has an asynchronous mode for performance, and there the trouble starts. Andreas
Zeugswetter Andreas SB SD wrote: > > > > From our previous discussion of 2-phase commit, there was concern that > > > the failure modes of 2-phase commit were not solvable. However, I think > > > multi-master replication is going to have similar non-solvable failure > > > modes, yet people still want multi-master replication. > > > > No. The real problem with 2PC in my mind is that its failure modes > > occur *after* you have promised commit to one or more parties. In > > multi-master, if you fail you know it before you have told the client > > his data is committed. > > Hmm ? The appl cannot take the first phase commit as its commit info. It > needs to wait for the second phase commit. The second phase is only finished > when all coservers have reported back. 2PC is synchronous. > > The problems with 2PC are when after second phase commit was sent to all > servers and before all report back one of them becomes unreachable/down ... > (did it receive and do the 2nd commit or not) Such a transaction must stay > open until the coserver is reachable again or an administrator committed/aborted it. > > It is multi master replication that usually has an asynchronous mode for > performance, and there the trouble starts. Let me diagram this so we can see the issues. Normal operation is: Master Slave------ -----commit ready--> <--OKcommit done---> <--OKcompleted One possible failure is: Master Slave------ -----commit ready--> <--OKcommit done---> dies herestuck waiting Another possible failure is: Master Slave------ -----commit ready--> <--OKdies here stuck waiting Are these the issues? Can't we just add GUC timeouts to cause the commit to fail, and the slave to stop waiting? I suppose a problem is: Master Slave------ -----commit ready--> <--OKsleep stuck waiting, times outcommit done Could we allow slaves to check if the backend is still alive, perhaps by asking the postmaster, similar to what we do with the cancel signal --- that way, the slave would never time out and always wait if the master was alive. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Could we allow slaves to check if the backend is still alive, perhaps by > asking the postmaster, similar to what we do with the cancel signal --- > that way, the slave would never time out and always wait if the master > was alive. You're not considering the possibility of a transient communication failure. The fact that you cannot currently contact the other guy is not proof that he's not still alive. Example: Master Slave------ -----commit ready--> <--OKcommit done->XX where "->XX" means the message gets lost due to network failure. Now what? The slave cannot abort; he promised he could commit, and he does not know whether the master has committed or not. The master does not know the slave's state either; maybe he got the second message, and maybe he didn't. Both sides are forced to keep information about the open transaction indefinitely. Timing out on either side could yield the wrong result. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Could we allow slaves to check if the backend is still alive, perhaps by > > asking the postmaster, similar to what we do with the cancel signal --- > > that way, the slave would never time out and always wait if the master > > was alive. > > You're not considering the possibility of a transient communication > failure. The fact that you cannot currently contact the other guy > is not proof that he's not still alive. > > Example: > > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > > where "->XX" means the message gets lost due to network failure. Now > what? The slave cannot abort; he promised he could commit, and he does > not know whether the master has committed or not. The master does not > know the slave's state either; maybe he got the second message, and > maybe he didn't. Both sides are forced to keep information about the > open transaction indefinitely. Timing out on either side could yield > the wrong result. Can't the master re-send the request after a timeout? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 26 Sep 2003, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Could we allow slaves to check if the backend is still alive, perhaps by > > asking the postmaster, similar to what we do with the cancel signal --- > > that way, the slave would never time out and always wait if the master > > was alive. > > You're not considering the possibility of a transient communication > failure. The fact that you cannot currently contact the other guy > is not proof that he's not still alive. > > Example: > > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > > where "->XX" means the message gets lost due to network failure. Now 'k, but isn't alot of that a "retry" issue? we're talking TCP here, not UDP, which I *thought* was designed for transient network problems ... ? I would think that any implementation would have a timeout/retry GUC variable associated with it ... 'if no answer in x seconds, retry up to y times' ... if we are talking two computers sitting next to each other on a switch, you'd expect those to be low ... but if you were talking about two seperate geographical locations (and yes, I realize you are adding lag to the mix with waiting for responses), you'd expect those #s to rise ...
On Fri, 26 Sep 2003, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Could we allow slaves to check if the backend is still alive, perhaps by > > asking the postmaster, similar to what we do with the cancel signal --- > > that way, the slave would never time out and always wait if the master > > was alive. > > You're not considering the possibility of a transient communication > failure. The fact that you cannot currently contact the other guy > is not proof that he's not still alive. > > Example: > > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > > where "->XX" means the message gets lost due to network failure. Now > what? 'k, but isn't alot of that a "retry" issue? we're talking TCP here, not UDP, which I *thought* was designed for transient network problems ... ? I would think that any implementation would have a timeout/retry GUC variable associated with it ... 'if no answer in x seconds, retry up to y times' ... if we are talking two computers sitting next to each other on a switch, you'd expect those to be low ... but if you were talking about two seperate geographical locations (and yes, I realize you are adding lag to the mix with waiting for responses), you'd expect those #s to rise ...
On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: ... > if we are talking two computers sitting next to each other on a switch, > you'd expect those to be low ... but if you were talking about two > seperate geographical locations (and yes, I realize you are adding lag to > the mix with waiting for responses), you'd expect those #s to rise ... Which I thought was the whole point of using a group communication protocol such as spread in postgresql-r. It seemed solved there... Cheers, Patrick
Patrick Welche wrote: > On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: > ... > > if we are talking two computers sitting next to each other on a switch, > > you'd expect those to be low ... but if you were talking about two > > seperate geographical locations (and yes, I realize you are adding lag to > > the mix with waiting for responses), you'd expect those #s to rise ... > > Which I thought was the whole point of using a group communication protocol > such as spread in postgresql-r. It seemed solved there... Right, but I think we want to try to do two-phase commit without spread. Spread seems overkill for this usage. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> You're not considering the possibility of a transient communication >> failure. > Can't the master re-send the request after a timeout? Not "it can", but "it has to". The master *must* keep hold of that request forever (or until the slave responds, or until we reconfigure the system not to consider that slave valid anymore). Similarly, the slave cannot forget the maybe-committed transaction on pain of not being a valid slave anymore. You can make this work, but the resource costs are steep. For instance, in Postgres, you don't get to truncate the WAL log, for what could be a really really long time --- more disk space than you wanted to spend on WAL anyway. The locks held by the maybe-committed transaction are another potentially unpleasant problem; you can't release them, no matter what else they are blocking. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> You're not considering the possibility of a transient communication > >> failure. > > > Can't the master re-send the request after a timeout? > > Not "it can", but "it has to". The master *must* keep hold of that > request forever (or until the slave responds, or until we reconfigure > the system not to consider that slave valid anymore). Similarly, the > slave cannot forget the maybe-committed transaction on pain of not being > a valid slave anymore. You can make this work, but the resource costs > are steep. For instance, in Postgres, you don't get to truncate the WAL > log, for what could be a really really long time --- more disk space > than you wanted to spend on WAL anyway. The locks held by the > maybe-committed transaction are another potentially unpleasant problem; > you can't release them, no matter what else they are blocking. I think we would need a configurable timeout to say a slave is no longer valid, like 60 seconds, and then let everyone release. We can let the administrator decide how long he wants to try to keep two hosts communicating. I don't see this as much different from multi-master replication problems. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 26 Sep 2003, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> You're not considering the possibility of a transient communication > >> failure. > > > Can't the master re-send the request after a timeout? > > Not "it can", but "it has to". The master *must* keep hold of that > request forever (or until the slave responds, or until we reconfigure > the system not to consider that slave valid anymore). Similarly, the > slave cannot forget the maybe-committed transaction on pain of not being > a valid slave anymore. Hrmmmm ... is there no way of having part of the protocol being a message sent back that its a valid/invalid slave? ie. slave has an uncommitted transaction, never hears back from master to actually do the commit, so after x-secs * y-retries any messages it does try to send to the master have a bit flag set to 'invalid'?
pgman@candle.pha.pa.us (Bruce Momjian) writes: > Patrick Welche wrote: >> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: >> ... >> > if we are talking two computers sitting next to each other on a switch, >> > you'd expect those to be low ... but if you were talking about two >> > seperate geographical locations (and yes, I realize you are adding lag to >> > the mix with waiting for responses), you'd expect those #s to rise ... >> >> Which I thought was the whole point of using a group communication >> protocol such as spread in postgresql-r. It seemed solved there... > > Right, but I think we want to try to do two-phase commit without > spread. Spread seems overkill for this usage. Is there some big demerit to _having_ that "overkill"? If there is no major price to pay, then I don't see why it isn't reasonable to simply say "Sure, we'll use that!" After all, PostgreSQL is set up to do _everything_ inside transactions, even though there are some actions you might take that don't forcibly need to be transactional. That's overkill, and nobody (well, barring fans of Certain Other Databases) complains that it's overkill. -- let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
On Fri, Sep 26, 2003 at 01:34:28PM -0400, Tom Lane wrote: > > Example: > > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > maybe he didn't. Both sides are forced to keep information about the > open transaction indefinitely. Timing out on either side could yield > the wrong result. If i understand the complaints, I think there are two big issues. The first problem is the restart/rejoin problem. When a 2PC member goes away, it is supposed to come back with all its former locks and everything in place, so that it can know what to do. This is also extremely tricky, but I think the answer is sort of easy. A member which re-joins without crashing (that is, it has open transactions, &c.), it just has to complete its transactions with the other member(s). If other members have processed new transactions since the member left, the member is kicked out. It's not allowed to join without being re-initialised. A member which crashes is just a special case of this. This is not elegant, not nice, &c. But I don't think anyone can really guarantee that a crahsed member will start up correctly (it crashed, after all; maybe there's a bug). So this is the safest approach, and I don't think it's a big deal. It's not cheap, of course, and there may be problems arising from the conditions I describe below. But I think they can be handled (see the section on "compromises", below) intelligently. The second, stickier problem is just as Tom describes. When the master is "Commit done" and that message doesn't make it to the other host(s), you might have to wait forever. Of course, that's not acceptable. But I can think of some options of how to decide to handle this. Note that these may not guarantee no loss of data. That's not a compromise one is usually willing to make; but just because I don't want to accept that compromise doesn't mean it is unacceptable to everyone. Some possible compromises ========================= 1, One machine always wins. One could decide to pick one machine that, in case of some sort of failure, always wins. You need some sort of heartbeat system which checks for the other member(s) of the cluster. In the event of failure, whatever is on the "winner" machine is deemed to be correct, and everyone else has to lose. If the point of your 2PC is to provide synchronous access to high loads of read-only clients, this would probably be a good solution, since only one machine would ever see data changes. 2. Quorum rule. One could decide on a quorum of machines, and the group which has quorum wins. (Naturally, this has to be an absolute majority.) The quorum can continue to process queries, and the folks who left the room have to re-sync to join. 3. Fail to read-only status and let the DBA sort it out. 4. Mark the contentious rows as "bad" and let the DBA sort it out. This option is not dissimilar to what Access/SQL server disconnected multi-master replication does. It's not elegant, but it might be a good answer for the cases where 2PC gets used. Note that none of these can guarantee that some apparently committed data will not later be lost. To real database hounds, that will sound like apostasy, but I suspect it is the sort of trade-off that real products make all the time. You have to have a way of collecting the "yeah, we told you it was committed, but we lied" data and being able to track it; and that has to be enough. The real security-of-data work is going to have to be done by ultra-reliable hardware, good maintenance practices, &c. Then when losses are down in the .001% range from this sort of mistake, no one will care. This is not, by the way, the fully-formed set of suggestions I said I'd deliver when I started the thread; but since it came up again today, I thought I'd respond with what I had so far. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Fri, Sep 26, 2003 at 02:05:36PM -0400, Tom Lane wrote: > a valid slave anymore. You can make this work, but the resource costs > are steep. For instance, in Postgres, you don't get to truncate the WAL But people who want 2PC are more than willing to pay all that cost. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Fri, 26 Sep 2003, Christopher Browne wrote: > pgman@candle.pha.pa.us (Bruce Momjian) writes: > > Patrick Welche wrote: > >> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: > >> ... > >> > if we are talking two computers sitting next to each other on a switch, > >> > you'd expect those to be low ... but if you were talking about two > >> > seperate geographical locations (and yes, I realize you are adding lag to > >> > the mix with waiting for responses), you'd expect those #s to rise ... > >> > >> Which I thought was the whole point of using a group communication > >> protocol such as spread in postgresql-r. It seemed solved there... > > > > Right, but I think we want to try to do two-phase commit without > > spread. Spread seems overkill for this usage. > > Is there some big demerit to _having_ that "overkill"? If there is no > major price to pay, then I don't see why it isn't reasonable to simply > say "Sure, we'll use that!" Reliance on a third party library to be installed to provide the functionality ... if it were meant as an "add on" instead of "standard feature", then sure ...
On Fri, 2003-09-26 at 13:58, Bruce Momjian wrote: > Patrick Welche wrote: > > On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: > > ... > > > if we are talking two computers sitting next to each other on a switch, > > > you'd expect those to be low ... but if you were talking about two > > > seperate geographical locations (and yes, I realize you are adding lag to > > > the mix with waiting for responses), you'd expect those #s to rise ... > > > > Which I thought was the whole point of using a group communication protocol > > such as spread in postgresql-r. It seemed solved there... > > Right, but I think we want to try to do two-phase commit without spread. > Spread seems overkill for this usage. Out of curiosity, how does one use spread to accomplish 2PC? Isn't the logic the Application Server would need to follow rather different with a group communication based control than with XA / 2PC style communication? How does one map to the other?
> The first problem is the restart/rejoin problem. When a 2PC member > goes away, it is supposed to come back with all its former locks and > everything in place, so that it can know what to do. This is also > extremely tricky, but I think the answer is sort of easy. A member > which re-joins without crashing (that is, it has open transactions, I think you may be confusing 2PC with replication. PostgreSQLs 2PC implementation should follow enough of the XA rules to play nice in a mixed environment where something else is managing the transactions (application servers are becoming more common all the time). As far as inter-PostgreSQL replication / queries are concerned we can choose whatever semantics we like -- just realize that they are 2 different problems.
Marc G. Fournier wrote: > On Fri, 26 Sep 2003, Tom Lane wrote: > >>Bruce Momjian <pgman@candle.pha.pa.us> writes: >> >>>Tom Lane wrote: >>> >>>>You're not considering the possibility of a transient communication >>>>failure. >> >>>Can't the master re-send the request after a timeout? >> >>Not "it can", but "it has to". The master *must* keep hold of that >>request forever (or until the slave responds, or until we reconfigure >>the system not to consider that slave valid anymore). Similarly, the >>slave cannot forget the maybe-committed transaction on pain of not being >>a valid slave anymore. > > Hrmmmm ... is there no way of having part of the protocol being a message > sent back that its a valid/invalid slave? ie. slave has an uncommitted > transaction, never hears back from master to actually do the commit, so > after x-secs * y-retries any messages it does try to send to the master > have a bit flag set to 'invalid'? If I understand Andrew Sullivan's request, the purpose for integration of 2-PC into PostgreSQL, is more for distributed query than replication via an XA interface: http://sybooks.sybase.com/onlinebooks/group-xsarc/xsg1111e/xatuxedo/@ebt-link;pt=61?target=%25N%13_446_START_RESTART_N%25 If that is the desire (XA-compatibility) then PostgreSQL might be talking to an Oracle database or a BEA Tuxedo TPM acting as the coordinator. So PostgreSQL won't have an opportunity to modify the protocol in any meaningful way if it wishes to interoperate with XA-based transaction managers. If it is being used only amongst other PostgreSQL backends for replication, then why not use one of the optimistic replication protocols: http://www.inf.ethz.ch/personal/alonso/PAPERS/commit-fast.pdf Mike Mascari mascarm@mascari.com
On Fri, 26 Sep 2003, Christopher Browne wrote: > pgman@candle.pha.pa.us (Bruce Momjian) writes: > > Patrick Welche wrote: > >> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote: > >> ... > >> > if we are talking two computers sitting next to each other on a switch, > >> > you'd expect those to be low ... but if you were talking about two > >> > seperate geographical locations (and yes, I realize you are adding lag to > >> > the mix with waiting for responses), you'd expect those #s to rise ... > >> > >> Which I thought was the whole point of using a group communication > >> protocol such as spread in postgresql-r. It seemed solved there... > > > > Right, but I think we want to try to do two-phase commit without > > spread. Spread seems overkill for this usage. > > Is there some big demerit to _having_ that "overkill"? If there is no > major price to pay, then I don't see why it isn't reasonable to simply > say "Sure, we'll use that!" I recall Darren Johnson (who is working on replication with spread) saying that it required a lot of bandwidth in real world scenarios. Gavin
> Not "it can", but "it has to". The master *must* keep hold of that > request forever (or until the slave responds, or until we reconfigure > the system not to consider that slave valid anymore). Similarly, the > slave cannot forget the maybe-committed transaction on pain of not being > a valid slave anymore. You can make this work, but the resource costs > are steep. For instance, in Postgres, you don't get to truncate the WAL > log, for what could be a really really long time --- more disk space > than you wanted to spend on WAL anyway. The locks held by the > maybe-committed transaction are another potentially unpleasant problem; > you can't release them, no matter what else they are blocking. So, after 'n' seconds of waiting, we abandon the slave and the slave abandons the master. Such a condition is probably a fairly serious failure anyway, and something that an admin would need to expect. The admin would also need to expect to allocate a heap of disk space for WAL. Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> ... You can make this work, but the resource costs >> are steep. > So, after 'n' seconds of waiting, we abandon the slave and the slave > abandons the master. [itch...] But you surely cannot guarantee that the slave and the master time out at exactly the same femtosecond. What happens when the comm link comes back online just when one has timed out and the other not? (Hint: in either order, it ain't good. Double plus ungood if, say, the comm link manages to deliver the master's "commit confirm" message a little bit after the master has timed out and decided to abort after all.) In my book, timeout-based solutions to this kind of problem are certain disasters. regards, tom lane
On Saturday 27 September 2003 06:59, Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > >> ... You can make this work, but the resource costs > >> are steep. > > > > So, after 'n' seconds of waiting, we abandon the slave and the slave > > abandons the master. > > [itch...] But you surely cannot guarantee that the slave and the master > time out at exactly the same femtosecond. What happens when the comm > link comes back online just when one has timed out and the other not? > (Hint: in either order, it ain't good. Double plus ungood if, say, the > comm link manages to deliver the master's "commit confirm" message a > little bit after the master has timed out and decided to abort after all.) > > In my book, timeout-based solutions to this kind of problem are certain > disasters. I might be (well, am actually) a bit out of my depth here, but surely what happens is if you have machines A,B,C and *any* of them thinks machine C has a problem then it does. If C can still communicate with the others then it is told to reinitialise/go away/start the sirens. If C can't communicate then it's all a bit academic. Granted, if you have intermittent problems on a link and set your timeouts badly then you'll have a very brittle system, but if A thinks C has died, you can't just reverse that decision. -- Richard Huxton Archonet Ltd
On Sat, 27 Sep 2003, Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > >> ... You can make this work, but the resource costs > >> are steep. > > > So, after 'n' seconds of waiting, we abandon the slave and the slave > > abandons the master. > > [itch...] But you surely cannot guarantee that the slave and the master > time out at exactly the same femtosecond. What happens when the comm > link comes back online just when one has timed out and the other not? > (Hint: in either order, it ain't good. I think it was Andrew that suggested it ... when the slave timesout, it should "trigger" a READ ONLY mode on the slave, so that when/if the master tries to start to talk to it, it can't ... As for the master itself, it should be smart enough that if it times out, it knows to actually abandom the slave and not continue to try ...
Richard Huxton wrote: > > [itch...] But you surely cannot guarantee that the slave and the master > > time out at exactly the same femtosecond. What happens when the comm > > link comes back online just when one has timed out and the other not? > > (Hint: in either order, it ain't good. Double plus ungood if, say, the > > comm link manages to deliver the master's "commit confirm" message a > > little bit after the master has timed out and decided to abort after all.) > > > > In my book, timeout-based solutions to this kind of problem are certain > > disasters. > > I might be (well, am actually) a bit out of my depth here, but surely what > happens is if you have machines A,B,C and *any* of them thinks machine C has > a problem then it does. If C can still communicate with the others then it is > told to reinitialise/go away/start the sirens. If C can't communicate then > it's all a bit academic. > > Granted, if you have intermittent problems on a link and set your timeouts > badly then you'll have a very brittle system, but if A thinks C has died, you > can't just reverse that decision. I have been thinking it might be time to start allowing external programs to be called when certain events occur that require administrative attention --- this would be a good case for that. Administrators could configure shell scripts to be run when the network connection fails or servers drop off the network, alerting them to the problem. Throwing things into the server logs isn't _active_ enough. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Saturday 27 September 2003 20:17, Bruce Momjian wrote: > Richard Huxton wrote: > I have been thinking it might be time to start allowing external > programs to be called when certain events occur that require > administrative attention --- this would be a good case for that. > Administrators could configure shell scripts to be run when the network > connection fails or servers drop off the network, alerting them to the > problem. Throwing things into the server logs isn't _active_ enough. I would say calling events from external libraries would be a good extension. That could allow for extending postgresql in novel way. e.g. calling a logrecord copy event after a WAL record is written for near real time replication..:-) Shridhar
On Saturday 27 September 2003 15:47, Bruce Momjian wrote: > Richard Huxton wrote: [snip] > > I might be (well, am actually) a bit out of my depth here, but surely > > what happens is if you have machines A,B,C and *any* of them thinks > > machine C has a problem then it does. If C can still communicate with the > > others then it is told to reinitialise/go away/start the sirens. If C > > can't communicate then it's all a bit academic. > > [snip] > > I have been thinking it might be time to start allowing external > programs to be called when certain events occur that require > administrative attention --- this would be a good case for that. > Administrators could configure shell scripts to be run when the network > connection fails or servers drop off the network, alerting them to the > problem. Throwing things into the server logs isn't _active_ enough. Actually, from the discussion I'd assumed there was some sort of plug-in "policy daemon" that was making decisions when things went wrong. Given the different scenarios 2 phase-commit will be used in, one size is unlikely to fit all. The idea of a more general system is _very_ interesting. I know Wietse Venema has decided to provide an external "policy" interface for his Postfix mailserver, precisely because he wants to keep the core system fairly clean. -- Richard Huxton Archonet Ltd
On Sat, 27 Sep 2003, Bruce Momjian wrote: > I have been thinking it might be time to start allowing external > programs to be called when certain events occur that require > administrative attention --- this would be a good case for that. > Administrators could configure shell scripts to be run when the network > connection fails or servers drop off the network, alerting them to the > problem. Throwing things into the server logs isn't _active_ enough. Actually, apparently you can do this now ... there is apparently a "mail module" for PostgreSQL that you can use to have the database send email's out ...
Marc G. Fournier wrote: > > > On Sat, 27 Sep 2003, Bruce Momjian wrote: > > > I have been thinking it might be time to start allowing external > > programs to be called when certain events occur that require > > administrative attention --- this would be a good case for that. > > Administrators could configure shell scripts to be run when the network > > connection fails or servers drop off the network, alerting them to the > > problem. Throwing things into the server logs isn't _active_ enough. > > Actually, apparently you can do this now ... there is apparently a "mail > module" for PostgreSQL that you can use to have the database send email's > out ... The only part that needs to be added is the ability to call an external program when some even occurs, like a database write failure. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> -----Original Message----- > From: Tom Lane > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> You're not considering the possibility of a transient communication > >> failure. > > > Can't the master re-send the request after a timeout? > > Not "it can", but "it has to". Why ? Mainly the coordinator(slave) not the participant(master) has the resposibilty to resolve the in-doubt transaction. regards, Hiroshi Inoue
Bruce Momjian wrote: > Marc G. Fournier wrote: > > > > > > On Sat, 27 Sep 2003, Bruce Momjian wrote: > > > > > I have been thinking it might be time to start allowing external > > > programs to be called when certain events occur that require > > > administrative attention --- this would be a good case for that. > > > Administrators could configure shell scripts to be run when the network > > > connection fails or servers drop off the network, alerting them to the > > > problem. Throwing things into the server logs isn't _active_ enough. > > > > Actually, apparently you can do this now ... there is apparently a "mail > > module" for PostgreSQL that you can use to have the database send email's > > out ... > > The only part that needs to be added is the ability to call an external > program when some even occurs, like a database write failure. Actually, all that's really necessary is the ability to call a stored procedure when some event occurs. The stored procedure can take it from there, and since it can be written in C it can do anything the postgres user can do (for good or for ill, of course). -- Kevin Brown kevin@sysexperts.com
Kevin Brown wrote: > Bruce Momjian wrote: > > Marc G. Fournier wrote: > > > > > > > > > On Sat, 27 Sep 2003, Bruce Momjian wrote: > > > > > > > I have been thinking it might be time to start allowing external > > > > programs to be called when certain events occur that require > > > > administrative attention --- this would be a good case for that. > > > > Administrators could configure shell scripts to be run when the network > > > > connection fails or servers drop off the network, alerting them to the > > > > problem. Throwing things into the server logs isn't _active_ enough. > > > > > > Actually, apparently you can do this now ... there is apparently a "mail > > > module" for PostgreSQL that you can use to have the database send email's > > > out ... > > > > The only part that needs to be added is the ability to call an external > > program when some even occurs, like a database write failure. > > Actually, all that's really necessary is the ability to call a stored > procedure when some event occurs. The stored procedure can take it from > there, and since it can be written in C it can do anything the postgres > user can do (for good or for ill, of course). But the postmaster doesn't connect to any database, and in a serious failure, might not be able to start one. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Kevin Brown wrote: > > Actually, all that's really necessary is the ability to call a stored > > procedure when some event occurs. The stored procedure can take it from > > there, and since it can be written in C it can do anything the postgres > > user can do (for good or for ill, of course). > > But the postmaster doesn't connect to any database, and in a serious > failure, might not be able to start one. Ah, true. But I figured that in the context of 2PC and replication that most of the associated failures were likely to occur in an active backend or something equivalent, where a stored procedure was likely to be accessible. But yes, you certainly want to account for failures where the database itself is unavailable. So I guess my original comment isn't strictly true. :-) -- Kevin Brown kevin@sysexperts.com
> > Actually, all that's really necessary is the ability to call a stored > > procedure when some event occurs. The stored procedure can take it from > > there, and since it can be written in C it can do anything the postgres > > user can do (for good or for ill, of course). > > But the postmaster doesn't connect to any database, and in a serious > failure, might not be able to start one. In the event of a catastrophic, the 'nothing is running' scenario is one standard monitoring software should pick up on that easily enough. One that PostgreSQL cannot help with anyway (normally this is admin error). Something simple much like pg_locks with transaction state (idle, waiting on local lock, waiting on 3rd party, etc.), time transaction started, time of last status change would be plenty. The monitor software folks (Big Brother, etc. etc.) can write jobs to query those elements and create the appropriate SNMP events when say waiting on 3rd party for > N minutes (log at 1, trouble ticket at 2, SysAdmin page at 5, escalate to VP Pager at 20 minutes or whatever corporate policy is). An alternative is to package an SNMP daemon (much like the stats daemon) into the backend to generate SNMP events -- but I think this is overkill if views are available.
Hiroshi Inoue wrote: > > > -----Original Message----- > > From: Tom Lane > > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Tom Lane wrote: > > >> You're not considering the possibility of a transient communication > > >> failure. > > > > > Can't the master re-send the request after a timeout? > > > > Not "it can", but "it has to". > > Why ? Mainly the coordinator(slave) not the participant(master) > has the resposibilty to resolve the in-doubt transaction. As far as I see, it's the above point which prevents the advance of this topic and the issue must be solved ASAP. As opposed to your answer Not "it can", but "it has to", my answer is Yes "it can", but "it doesn't have to". The simplest senario(though there could be varations) is [At participant(master)'s side] Because the commit operations is done, does nothing. [At coordinator(slave)' side] 1) After a while 2) re-establish the communication path between the partcipant(master)'sTM. 3) resend the "commit requeset" to the participant's TM. 1)2)3) would be repeated until the coordinatorreceives the "commit ok" message from the partcipant. If there's no objection from you, I would assume I'm right. Please don't dodge my question this time. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
On Mon, 29 Sep 2003, Hiroshi Inoue wrote: > The simplest senario(though there could be varations) is > > [At participant(master)'s side] > Because the commit operations is done, does nothing. > > [At coordinator(slave)' side] > 1) After a while > 2) re-establish the communication path between the > partcipant(master)'s TM. > 3) resend the "commit requeset" to the participant's TM. > 1)2)3) would be repeated until the coordinator receives > the "commit ok" message from the partcipant. > > If there's no objection from you, I would assume I'm right. 'K, but what happens if the slave never gets a 'commit ok'? Does the slave keep trying ad nausem?
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > The simplest senario(though there could be varations) is > [At participant(master)'s side] > Because the commit operations is done, does nothing. > [At coordinator(slave)' side] > 1) After a while > 2) re-establish the communication path between the > partcipant(master)'s TM. > 3) resend the "commit requeset" to the participant's TM. > 1)2)3) would be repeated until the coordinator receives > the "commit ok" message from the partcipant. [ scratches head ] I think you are using the terms "master" and "slave" oppositely than I would. But in any case, this is not an answer to the concern I had. You're assuming that the "coordinator(slave)" side is willing to resend a request indefinitely, and also that the "participant(master)" side is willing to retain per-transaction commit state indefinitely so that it can correctly answer belated questions from the other side. What I was complaining about was that I don't think either side can afford to remember per-transaction state indefinitely. 2PC in the abstract is a useless academic abstraction --- where the rubber meets the road is defining how you cope with failures in the commit protocol. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > The simplest senario(though there could be varations) is > > > [At participant(master)'s side] > > Because the commit operations is done, does nothing. > > > [At coordinator(slave)' side] > > 1) After a while > > 2) re-establish the communication path between the > > partcipant(master)'s TM. > > 3) resend the "commit requeset" to the participant's TM. > > 1)2)3) would be repeated until the coordinator receives > > the "commit ok" message from the partcipant. > > [ scratches head ] I think you are using the terms "master" and "slave" > oppositely than I would. Oops my mistake, sorry. But is it 2-phase commit protocol in the first place ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue wrote: > > Tom Lane wrote: > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > The simplest senario(though there could be varations) is > > > > > [At participant(master)'s side] > > > Because the commit operations is done, does nothing. > > > > > [At coordinator(slave)' side] > > > 1) After a while > > > 2) re-establish the communication path between the > > > partcipant(master)'s TM. > > > 3) resend the "commit requeset" to the participant's TM. > > > 1)2)3) would be repeated until the coordinator receives > > > the "commit ok" message from the partcipant. > > > > [ scratches head ] I think you are using the terms "master" and "slave" > > oppositely than I would. > > Oops my mistake, sorry. > But is it 2-phase commit protocol in the first place ? That is, in your exmaple below Example: Master Slave ------ ----- commit ready--> <--OK commit done->XX is the "commit done" message needed ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > The simplest senario(though there could be varations) is > > > [At participant(master)'s side] > > Because the commit operations is done, does nothing. > > > [At coordinator(slave)' side] > > 1) After a while > > 2) re-establish the communication path between the > > partcipant(master)'s TM. > > 3) resend the "commit requeset" to the participant's TM. > > 1)2)3) would be repeated until the coordinator receives > > the "commit ok" message from the partcipant. > > [ scratches head ] I think you are using the terms "master" and "slave" > oppositely than I would. But in any case, this is not an answer to the > concern I had. You're assuming that the "coordinator(slave)" side is > willing to resend a request indefinitely, and also that the > "participant(master)" side is willing to retain per-transaction commit > state indefinitely so that it can correctly answer belated questions > from the other side. What I was complaining about was that I don't > think either side can afford to remember per-transaction state > indefinitely. OK maybe I understand your complaint. Basically such situation can occur when either side is down. Especially when the coodinator(master) is down, the particicipants are troubled. In such cases, e.g. XA interface allows heuristic-commit on the participants. In case one or more paricipants are down, the coordinator may have to remember per-transaction state indefinitely. Is it a big problem ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
I seem to have misunderstood the problem completely. I apologize to you all(especially Tom) for disturbing this thread. I wonder if there might be such a nice solution when some of the systems or communications are dead. And as many people already mentioned, there's not so much allowance if we only adopt XA-based protocol. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/ Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > The simplest senario(though there could be varations) is > > > [At participant(master)'s side] > > Because the commit operations is done, does nothing. > > > [At coordinator(slave)' side] > > 1) After a while > > 2) re-establish the communication path between the > > partcipant(master)'s TM. > > 3) resend the "commit requeset" to the participant's TM. > > 1)2)3) would be repeated until the coordinator receives > > the "commit ok" message from the partcipant. > > [ scratches head ] I think you are using the terms "master" and "slave" > oppositely than I would. But in any case, this is not an answer to the > concern I had. You're assuming that the "coordinator(slave)" side is > willing to resend a request indefinitely, and also that the > "participant(master)" side is willing to retain per-transaction commit > state indefinitely so that it can correctly answer belated questions > from the other side. What I was complaining about was that I don't > think either side can afford to remember per-transaction state > indefinitely. 2PC in the abstract is a useless academic abstraction --- > where the rubber meets the road is defining how you cope with failures > in the commit protocol. > > regards, tom lane
> > > > The simplest senario(though there could be varations) is > > > > > > > [At participant(master)'s side] > > > > Because the commit operations is done, does nothing. > > > > > > > [At coordinator(slave)' side] > > > > 1) After a while > > > > 2) re-establish the communication path between the > > > > partcipant(master)'s TM. > > > > 3) resend the "commit requeset" to the participant's TM. > > > > 1)2)3) would be repeated until the coordinator receives > > > > the "commit ok" message from the partcipant. > > > > > > [ scratches head ] I think you are using the terms "master" and "slave" > > > oppositely than I would. > > > > Oops my mistake, sorry. > > But is it 2-phase commit protocol in the first place ? > > That is, in your exmaple below > > Example: > > Master Slave > ------ ----- > commit ready--> This is the commit for phase 1. This commit is allowed to return all sorts of errors, like violated deferred checks, out of diskspace, ... > <--OK > commit done->XX This is commit for phase 2, the slave *must* answer with "success" in all but hardware failure cases. (Note that instead the master could instead send rollback, e.g. because some other slave aborted) > is the "commit done" message needed ? So, yes this is needed. Andreas
On Mon, 29 Sep 2003, Hiroshi Inoue wrote: > > > Hiroshi Inoue wrote: > > > > Tom Lane wrote: > > > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > > The simplest senario(though there could be varations) is > > > > > > > [At participant(master)'s side] > > > > Because the commit operations is done, does nothing. > > > > > > > [At coordinator(slave)' side] > > > > 1) After a while > > > > 2) re-establish the communication path between the > > > > partcipant(master)'s TM. > > > > 3) resend the "commit requeset" to the participant's TM. > > > > 1)2)3) would be repeated until the coordinator receives > > > > the "commit ok" message from the partcipant. > > > > > > [ scratches head ] I think you are using the terms "master" and "slave" > > > oppositely than I would. > > > > Oops my mistake, sorry. > > But is it 2-phase commit protocol in the first place ? > > That is, in your exmaple below > > Example: > > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > > is the "commit done" message needed ? Of course ... how else will the Slave commit? From my understanding, the concept is that the master sends a commit ready to the slave, but the OK back is that "OK, I'm ready to commit whenever you are", at which point the master does its commit and tells the slave to do its ...
Marc G. Fournier wrote: > > Master Slave > > ------ ----- > > commit ready--> > > <--OK > > commit done->XX > > > > is the "commit done" message needed ? > > Of course ... how else will the Slave commit? From my understanding, the > concept is that the master sends a commit ready to the slave, but the OK > back is that "OK, I'm ready to commit whenever you are", at which point > the master does its commit and tells the slave to do its ... Or the slave could reject the request. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > > [At participant(master)'s side] > > Because the commit operations is done, does nothing. > > > [At coordinator(slave)' side] > > 1) After a while > > 2) re-establish the communication path between the > > partcipant(master)'s TM. > > 3) resend the "commit requeset" to the participant's TM. > > 1)2)3) would be repeated until the coordinator receives > > the "commit ok" message from the partcipant. > > [ scratches head ] I think you are using the terms "master" and "slave" > oppositely than I would. But in any case, this is not an answer to the > concern I had. You're assuming that the "coordinator(slave)" side is > willing to resend a request indefinitely, and also that the > "participant(master)" side is willing to retain per-transaction commit > state indefinitely so that it can correctly answer belated questions > from the other side. What I was complaining about was that I don't > think either side can afford to remember per-transaction state > indefinitely. 2PC in the abstract is a useless academic abstraction --- > where the rubber meets the road is defining how you cope with failures > in the commit protocol. I don't think there is any way to handle cases where the master or slave just disappears. The other machine isn't under the server's control, so it has no way of it knowing. I think we have to allow the administrator to set a timeout, or ask to wait indefinately, and allow them to call an external program to record the event or notify administrators. Multi-master replication has the same issues. My original point was that multi-master replication has the same limitations, but people still want it. Same for two-phase commit --- it has the same limitations, but people want it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 29 Sep 2003, Bruce Momjian wrote: > Marc G. Fournier wrote: > > > Master Slave > > > ------ ----- > > > commit ready--> > > > <--OK > > > commit done->XX > > > > > > is the "commit done" message needed ? > > > > Of course ... how else will the Slave commit? From my understanding, the > > concept is that the master sends a commit ready to the slave, but the OK > > back is that "OK, I'm ready to commit whenever you are", at which point > > the master does its commit and tells the slave to do its ... > > Or the slave could reject the request. Huh? The slave has that option?? In what circumstance?
Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >>> ... You can make this work, but the resource costs >>> are steep. > >> So, after 'n' seconds of waiting, we abandon the slave and the slave >> abandons the master. > > [itch...] But you surely cannot guarantee that the slave and the master > time out at exactly the same femtosecond. What happens when the comm > link comes back online just when one has timed out and the other not? > (Hint: in either order, it ain't good. Double plus ungood if, say, the > comm link manages to deliver the master's "commit confirm" message a > little bit after the master has timed out and decided to abort after all.) > > In my book, timeout-based solutions to this kind of problem are certain > disasters. > > regards, tom lane What do commercial databases do about 2PC or other multi-master solutions? You've done a good job of convincing me that it's unreliable no matter what (through your posts on this topic over a long time). However, I would think that something like Oracle or DB2 have some kind of answer for multi-master, and I'm curious what it is. If they don't, is it reasonable to make a test case that leaves their database inconsistent or hanging? I can (probably) get access to a SQL Server system to run some tests, if someone is interested. regards, jeff davis
> -----Original Message----- > From: Zeugswetter Andreas SB SD [mailto:ZeugswetterA@spardat.at] > > > > Example: > > > > Master Slave > > ------ ----- > > commit ready--> > > This is the commit for phase 1. This commit is allowed to return all > sorts of errors, like violated deferred checks, out of diskspace, ... > > > <--OK > > commit done->XX > > This is commit for phase 2, the slave *must* answer with "success" > in all but hardware failure cases. (Note that instead the > master could > instead send rollback, e.g. because some other slave aborted) > > > is the "commit done" message needed ? > > So, yes this is needed Thanks. I misunderstood that the "commit done" message is the last response from the participant to the coordinator. I missed the "OK" message before it. Where were my eyes ? regards, Hiroshi Inoue
> I don't think there is any way to handle cases where the master or slave > just disappears. The other machine isn't under the server's control, so > it has no way of it knowing. I think we have to allow the administrator > to set a timeout, or ask to wait indefinately, and allow them to call an > external program to record the event or notify administrators. > Multi-master replication has the same issues. Needs to wait indefinitely, a timeout is not acceptable since it leads to inconsistent data. Human (or monitoring software) intervention is needed if they can't reach each other in a reasonable time. I think this needs to be kept dumb. Different sorts of use cases will simply need different answers to resolve in-doubt transactions. What is needed is an interface that allows listing and commit/rollback of in-doubt transactions (preferably from a newly started client, or a direct command for the postmaster). Andreas
> > > Master Slave > > > ------ ----- > > > commit ready--> > > > <--OK > > > commit done->XX > > > > > > is the "commit done" message needed ? > > > > Of course ... how else will the Slave commit? From my > understanding, the > > concept is that the master sends a commit ready to the > slave, but the OK > > back is that "OK, I'm ready to commit whenever you are", at > which point > > the master does its commit and tells the slave to do its ... > > Or the slave could reject the request. At this point only because of a hardware error. In case of network problems the "commit done" eighter did not reach the slave or the "success" answer did not reach the master. That is what it's all about. Phase 2 is supposed to be low overhead and very fast to allow keeping the time window for failure (that produces in-doubt transactions) as short as possible. Andreas
Marc G. Fournier wrote: > > > > is the "commit done" message needed ? > > > > > > Of course ... how else will the Slave commit? From my understanding, the > > > concept is that the master sends a commit ready to the slave, but the OK > > > back is that "OK, I'm ready to commit whenever you are", at which point > > > the master does its commit and tells the slave to do its ... > > > > Or the slave could reject the request. > > Huh? The slave has that option?? In what circumstance? I thought the slave could reject if someone local already had the row locked. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > But is it 2-phase commit protocol in the first place ? > That is, in your exmaple below > Example: > Master Slave > ------ ----- > commit ready--> > <--OK > commit done->XX > is the "commit done" message needed ? Absolutely --- otherwise, we'd not be having this whole discussion. The problem is that the slave is holding ready to commit but doesn't know whether he should or not ... or alternatively, he did commit but the master didn't get the acknowledgement. It's not that big a deal for the master to remember past committed transactions until it knows all slaves have acknowledged committing them; you only need a bit or so per transaction. It's a much bigger deal if the slave has to hold the transaction ready-to-commit for a long time. That transaction is holding locks, and also the sheer volume of log data is way bigger. (For comparison, we recycle pg_xlog details about a transaction much sooner than we recycle pg_clog.) I think you really want some way for the slave to decide it can time out and abort the transaction after all ... but I don't see how you do that without breaking the 2PC protocol. regards, tom lane
> > > Or the slave could reject the request. > > > > Huh? The slave has that option?? In what circumstance? > > I thought the slave could reject if someone local already had the row > locked. No, not at all. The slave would need to reject phase 1 "commit ready" for this. Andreas
On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote: > > I think it was Andrew that suggested it ... when the slave timesout, it > should "trigger" a READ ONLY mode on the slave, so that when/if the master > tries to start to talk to it, it can't ... > > As for the master itself, it should be smart enough that if it times out, > it knows to actually abandom the slave and not continue to try ... Yes, but now we're talking as though this is master-slave replication. Actually, "master" and "slave" are only useful terms in a transaction for 2PC. So every machine is both a master and a slave. It seems that one way out is just to fall back to "read only" as soon as a single failure happens. That's the least graceful but maybe safest approach to failure, analogous to what fsck does to your root filesystem at boot time. Of course, since there's no "read only" mode at the moment, this is all pretty hand-wavy on my part :-/ A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Zeugswetter Andreas SB SD wrote: > > > > > Or the slave could reject the request. > > > > > > Huh? The slave has that option?? In what circumstance? > > > > I thought the slave could reject if someone local already had the row > > locked. > > No, not at all. The slave would need to reject phase 1 "commit ready" > for this. Oh, yea, thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Marc G. Fournier wrote: > >>> Or the slave could reject the request. > >> > >> Huh? The slave has that option?? In what circumstance? > > > I thought the slave could reject if someone local already had the row > > locked. > > All normal reasons for transaction failure are supposed to be checked > for before the slave responds that it's ready to commit. Otherwise it's > supposed to say it can't commit. > > Basically the weak spot of 2PC is that it assumes there are no possible > reasons for failure after "ready to commit" is sent. You can make that > approximately true, with sufficient investment of resources, but it's > definitely not a pleasant assumption. Yep. There is no full solution. I think it is like running with fsync off --- if the OS crashes, you have to clean up --- if you fail on a 2-phase commit, you have to clean up. Multi-master will be the same. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Sun, Sep 28, 2003 at 11:58:24AM -0700, Kevin Brown wrote: > > But the postmaster doesn't connect to any database, and in a serious > > failure, might not be able to start one. > > Ah, true. But I figured that in the context of 2PC and replication that > most of the associated failures were likely to occur in an active > backend or something equivalent, where a stored procedure was likely to > be accessible. AS you go on to note, that's not always a possibility. For instance, server C crashes and can't come back because, say, its WAL is scrabled. All it will currently be able to do is scream at you in the logs, which won't solve all the problems one has with 2PC (among other problems). A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan wrote: > On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote: > > > > I think it was Andrew that suggested it ... when the slave timesout, it > > should "trigger" a READ ONLY mode on the slave, so that when/if the master > > tries to start to talk to it, it can't ... > > > > As for the master itself, it should be smart enough that if it times out, > > it knows to actually abandom the slave and not continue to try ... > > Yes, but now we're talking as though this is master-slave > replication. Actually, "master" and "slave" are only useful terms in > a transaction for 2PC. So every machine is both a master and a > slave. > > It seems that one way out is just to fall back to "read only" as soon > as a single failure happens. That's the least graceful but maybe > safest approach to failure, analogous to what fsck does to your root > filesystem at boot time. Of course, since there's no "read only" > mode at the moment, this is all pretty hand-wavy on my part :-/ Yes, but that affects all users, not just the transaction we were working on. I think we have to get beyond the idea that this can be made failure-proof, and just outline the behaviors for failure, and it has to be configurable by the administrator. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Sep 29, 2003 at 11:14:30AM -0300, Marc G. Fournier wrote: > > > > Or the slave could reject the request. > > Huh? The slave has that option?? In what circumstance? In every circumstance where a stand-alone machine would have it. Machine A may not yet know about conflicting transactions on machine B. This is why 2PC is hard ;-) A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Marc G. Fournier wrote: >>> Or the slave could reject the request. >> >> Huh? The slave has that option?? In what circumstance? > I thought the slave could reject if someone local already had the row > locked. All normal reasons for transaction failure are supposed to be checked for before the slave responds that it's ready to commit. Otherwise it's supposed to say it can't commit. Basically the weak spot of 2PC is that it assumes there are no possible reasons for failure after "ready to commit" is sent. You can make that approximately true, with sufficient investment of resources, but it's definitely not a pleasant assumption. regards, tom lane
On Sat, Sep 27, 2003 at 08:36:36AM +0000, Jeff wrote: > > What do commercial databases do about 2PC or other multi-master solutions? > You've done a good job of convincing me that it's unreliable no matter what > (through your posts on this topic over a long time). However, I would think > that something like Oracle or DB2 have some kind of answer for > multi-master, and I'm curious what it is. If they don't, is it reasonable > to make a test case that leaves their database inconsistent or hanging? Most real replication systems are not doing 2PC. For me, 2PC-based replication is not real interesting anyway, because the point of multi-master replication is often at least partly speed, and 2PC is nothing if not a good way to make sure that every database is at least as slow as the slowest node. But 2PC is important for application-server-based, XA-type work, and for heterogenous databases. Both of those would be real nice features to support. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Fri, Sep 26, 2003 at 05:15:37PM -0400, Rod Taylor wrote: > > The first problem is the restart/rejoin problem. When a 2PC member > > goes away, it is supposed to come back with all its former locks and > > everything in place, so that it can know what to do. This is also > > extremely tricky, but I think the answer is sort of easy. A member > > which re-joins without crashing (that is, it has open transactions, > > I think you may be confusing 2PC with replication. No, I'm not. One needs to decide how to handle the situation where a slave database in a 2PC transaction goes away and comes back, for whatever reasons that may happen. Since the idea here is to come up with ways of handling the failure of 2PC in some cases, we need something which notices that members are not playing nice. > PostgreSQLs 2PC implementation should follow enough of the XA rules to > play nice in a mixed environment where something else is managing the > transactions (application servers are becoming more common all the > time). I agree. But we still need to decide how to handle cases where things go away, and if there are some transaction managers that don't fit that model, then we should not accept such managers. Of course, what such managers do is important data in deciding what sorts of compromises are acceptable. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> > It seems that one way out is just to fall back to "read only" as soon > > as a single failure happens. That's the least graceful but maybe > > safest approach to failure, analogous to what fsck does to your root > > filesystem at boot time. Of course, since there's no "read only" > > mode at the moment, this is all pretty hand-wavy on my part :-/ > > Yes, but that affects all users, not just the transaction we were > working on. I think we have to get beyond the idea that this can be made > failure-proof, and just outline the behaviors for failure, and it has to > be configurable by the administrator. Yes, but holding locks on the affected rows IS appropriate until the administrator issues something like: ALTER SYSTEM ABORT GLOBAL TRANSACTION 123;
Tom Lane writes: > No. The real problem with 2PC in my mind is that its failure modes > occur *after* you have promised commit to one or more parties. In > multi-master, if you fail you know it before you have told the client > his data is committed. I have a book here which claims that the solution to the problems of 2-phase commit is 3-phase commit, which goes something like this: coordinator participant ----------- ----------- INITIAL INITIALprepare --> WAIT<-- vote commit READY (all voted commit)prepare-to-commit --> PRE-COMMIT<-- ready-to-commit PRE-COMMITglobal-commit --> COMMIT COMMIT If the coordinator fails and all participants are in state READY, they can safely decide to abort after some timeout. If some participant is already in state PRE-COMMIT, it becomes the new coordinator and sends the global-commit message. Details are left as an exercise. :-) -- Peter Eisentraut peter_e@gmx.net
> No, I'm not. One needs to decide how to handle the situation where a > slave database in a 2PC transaction goes away and comes back, for > whatever reasons that may happen. Since the idea here is to come up > with ways of handling the failure of 2PC in some cases, we need > something which notices that members are not playing nice. Yes, you're right. The part about the member reinitializing lead me to believe that you were thinking replication (read it as copying data from source location to bring it back up to speed -- which is not what you intended).
Peter Eisentraut wrote: >Tom Lane writes: > > > >>No. The real problem with 2PC in my mind is that its failure modes >>occur *after* you have promised commit to one or more parties. In >>multi-master, if you fail you know it before you have told the client >>his data is committed. >> >> > >I have a book here which claims that the solution to the problems of >2-phase commit is 3-phase commit, which goes something like this: > >coordinator participant >----------- ----------- >INITIAL INITIAL > prepare --> >WAIT > <-- vote commit > READY >(all voted commit) > prepare-to-commit --> >PRE-COMMIT > <-- ready-to-commit > PRE-COMMIT > global-commit --> >COMMIT COMMIT > > >If the coordinator fails and all participants are in state READY, they can >safely decide to abort after some timeout. If some participant is already >in state PRE-COMMIT, it becomes the new coordinator and sends the >global-commit message. > >Details are left as an exercise. :-) > > Ok. Lets assume one coordinator, two partitipants. Global commit send to both by coordinator. One replies with ok, the other one remains silent. What should the coordinator do? It can't fail the transaction - the first partitipant has commited its part. It can't complete the transaction, because the ok from the 2nd partitipant is still outstanding. I think Bruce is right: It's an admin decision. If a timeout expires, a user supplied app should be called, with a safe default (database shutdown?). -- Manfred
> -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Monday, September 29, 2003 7:10 AM > To: Marc G. Fournier > Cc: Hiroshi Inoue; Tom Lane; 'Zeugswetter Andreas SB SD'; > 'Andrew Sullivan'; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] 2-phase commit > > > Marc G. Fournier wrote: > > > Master Slave > > > ------ ----- > > > commit ready--> > > > <--OK > > > commit done->XX > > > > > > is the "commit done" message needed ? > > > > Of course ... how else will the Slave commit? From my > understanding, > > the concept is that the master sends a commit ready to the > slave, but > > the OK back is that "OK, I'm ready to commit whenever you are", at > > which point the master does its commit and tells the slave > to do its > > ... > > Or the slave could reject the request. > Here is a BSD-like licensed transaction monitor: http://tyrex.sourceforge.net/tpmonitor.html The stuff that eventually became Tuxedo and Encina was open source from MIT (not sure what came of it). You used to be able to download the source code for their transaction monitor that worked on the IBM RS/2. This is the Transaction Internet Protocol: http://www.ietf.org/html.charters/OLD/tip-charter.html It should be considered very seriously as a general solution to the problem. I mention this, because a transaction monitor is the next logical step in managing database activity. Two phase commit is a subset of transaction processing. Interesting discussion: http://www.developer.com/db/article.php/10920_2246481_2 http://www.developer.com/java/data/article.php/10932_3066301_4 Article worth a look (win32 specific, but talks about developing a transaction monitor): http://archive.devx.com/free/mgznarch/vcdj/1998/octmag98/dtc1.asp Some simple background for those who have not spent much time looking into it: http://www.geocities.com/rajesh_purohit/db/twophasecommit.html
Manfred Spraul writes: > Ok. Lets assume one coordinator, two partitipants. > Global commit send to both by coordinator. One replies with ok, the > other one remains silent. > What should the coordinator do? It can't fail the transaction - the > first partitipant has commited its part. It can't complete the > transaction, because the ok from the 2nd partitipant is still outstanding. If a participant doesn't reply in an orderly fashion (say, after timeout), it just gets kicked out of the whole mechanism. That isn't the interesting part. The interesting part is what happens when the coordinator fails. -- Peter Eisentraut peter_e@gmx.net
On Mon, 2003-09-29 at 15:55, Peter Eisentraut wrote: > Manfred Spraul writes: > > > Ok. Lets assume one coordinator, two partitipants. > > Global commit send to both by coordinator. One replies with ok, the > > other one remains silent. > > What should the coordinator do? It can't fail the transaction - the > > first partitipant has commited its part. It can't complete the > > transaction, because the ok from the 2nd partitipant is still outstanding. > > If a participant doesn't reply in an orderly fashion (say, after timeout), > it just gets kicked out of the whole mechanism. That isn't the > interesting part. The interesting part is what happens when the > coordinator fails. The hot-standby coordinator picks up where the first one left off. Just like when the participant fails the hot-standby for that participant steps up to the plate. For the application server side in Java, I believe the standard is OTS (Object Transaction Service).
On Mon, Sep 29, 2003 at 12:59:55PM -0400, Bruce Momjian wrote: > working on. I think we have to get beyond the idea that this can be made > failure-proof, and just outline the behaviors for failure, and it has to > be configurable by the administrator. Exactly. There are plenty of cases where graceless failure is acceptable to someone as the right answer to the compromise. Of course, this is not to pretend they're not compromises. There's a world of difference between saying, "This is not safe, but if you want to do it, here are some potential failure modes," and, "Hey, you can use this even though it can't roll back 100% of the time, because your application should check that." Any comparison with any actual application I have had to use is strictly coincidental. ;-) A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Commercial systems use: Mainframe: CICS UNIX: Tuxedo Encina Win32: MTS DEC/COMPAQ/HP: ACMS Probably lots of others that I have never heard about.
On Mon, Sep 29, 2003 at 12:48:30PM -0400, Andrew Sullivan wrote: > In every circumstance where a stand-alone machine would have it. Oops. Wrong stage. Never mind. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
DCorbit@connx.com ("Dann Corbit") writes: > Tuxedo Note that this is probably the only one of the lot that is _really_ worth looking at in a serious way, as the XA standard was essentially based on Tuxedo. (Irrelevant Aside: BEA had releases of CICS running on both Unix and Windows NT, so it isn't quite fair to call that "mainframe" code...) There might be some value in looking at how Berkeley DB supports XA, as there actually support for using Berkeley DB as an XA resource manager. <http://www.sleepycat.com/docs/ref/xa/xa_intro.html> While it would obviously be exceedingly inappropriate to copy any of SleepyCat's software, there is some very useful background information there on "care and feeding" which can give an idea of how a TP monitor might be used and configured. -- "cbbrowne","@","libertyrms.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
A really nice overview of how various transaction managers are modeled: http://www.ti5.tu-harburg.de/Lecture/99ws/TP/06-OverviewOfTPSystemsAndPr oducts/sld001.htm
Marc G. Fournier wrote: > > On Sat, 27 Sep 2003, Bruce Momjian wrote: > > >>I have been thinking it might be time to start allowing external >>programs to be called when certain events occur that require >>administrative attention --- this would be a good case for that. >>Administrators could configure shell scripts to be run when the network >>connection fails or servers drop off the network, alerting them to the >>problem. Throwing things into the server logs isn't _active_ enough. > > > Actually, apparently you can do this now ... there is apparently a "mail > module" for PostgreSQL that you can use to have the database send email's > out ... > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > I guess someting such as CREATE TRIGGER my_trig ON BEGIN / COMMITEXECUTE ... would be nice. I think this can be used for many perposes (not necessarily 2PC). If a trigger could handle database events and not just events on tables. ON BEGIN ON COMMIT ON CREATE TABLE , ... We could have used that so often in the past in countless applications. Regards, Hans -- Cybertec Geschwinde u Schoenig Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/2952/30706 or +43/660/816 40 77 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
Andrew Sullivan wrote: > On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote: > > > > I think it was Andrew that suggested it ... when the slave timesout, it > > should "trigger" a READ ONLY mode on the slave, so that when/if the master > > tries to start to talk to it, it can't ... > > > > As for the master itself, it should be smart enough that if it times out, > > it knows to actually abandom the slave and not continue to try ... > > Yes, but now we're talking as though this is master-slave > replication. Actually, "master" and "slave" are only useful terms in > a transaction for 2PC. So every machine is both a master and a > slave. > > It seems that one way out is just to fall back to "read only" as soon > as a single failure happens. That's the least graceful but maybe > safest approach to failure, analogous to what fsck does to your root > filesystem at boot time. Of course, since there's no "read only" > mode at the moment, this is all pretty hand-wavy on my part :-/ OK, I think we came to the conclusion that we want 2-phase commit, but want some way to mark a server as offline/read-only, or notify an administrator. Can we communicate this to the Japanese guys working on 2-phase commit so they can start working toward including in 7.5? Added to TODO: * Add two-phase commit to all distributed transactions with offline/readonly server status or administrator notification for failure -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, Oct 08, 2003 at 05:43:49PM -0400, Bruce Momjian wrote: > > OK, I think we came to the conclusion that we want 2-phase commit, but > want some way to mark a server as offline/read-only, or notify an That sounds to me like the concusion, to the extent there was one, yes. I'd still like to hear from those who continue to have strong objections on the grounds of the impossibility of a guaranteed recovery method. Does the proposal of allowing dbas to run that risk, provided there's a mechanism to tell them about it, satisfy the objection (assuming, of course, 2PC can be turned off)? A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan writes: > Does the proposal of allowing dbas to run that risk, provided there's a > mechanism to tell them about it, satisfy the objection (assuming, of > course, 2PC can be turned off)? Why would you spent time on implementing a mechanism whose ultimate benefit is supposed to be increasing reliability and performance, when you already realize that it will have to lock up at the slightest sight of trouble? There are better mechanisms out there that you can use instead. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Andrew Sullivan writes: > > > Does the proposal of allowing dbas to run that risk, provided there's a > > mechanism to tell them about it, satisfy the objection (assuming, of > > course, 2PC can be turned off)? > > Why would you spent time on implementing a mechanism whose ultimate > benefit is supposed to be increasing reliability and performance, when you > already realize that it will have to lock up at the slightest sight of > trouble? There are better mechanisms out there that you can use instead. If you want cross-server transactions, what other methods are there that are more reliable? It seems network unreliability is going to be a problem no matter what method you use. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Oct 09, 2003 at 04:22:13PM +0200, Peter Eisentraut wrote: > Why would you spent time on implementing a mechanism whose ultimate > benefit is supposed to be increasing reliability and performance, when you > already realize that it will have to lock up at the slightest sight of > trouble? There are better mechanisms out there that you can use instead. "The slightest sign of trouble" seems to me to be overstating the matter rather. It cannot recover in the case where the first phase of commit has happened everywhere, and then the master crashes. We are talking, after all, about a pretty exotic feature in the first place. I presume that anyone who is using it is also using it on machines which have ultra-high-reliable, the cpu can catch on fire and the box stays up sort of hardware. I'll grant you that running a pair of B0b'5 C0mpu73r5 Ultra kewl sooper fa5t overclocked specials with serial ATA with the write cache enabled is a recipe for data loss. But that's a disaster no matter what. But you cannot have XA-like stuff without 2PC. You can't easily have heterogenous systems without 2PC. And folks have already generously volunteered to work on this problem; I think that they deserve support, assuming we can come up with some idea of what kinds of compromises are acceptable ones. There's no question that 2PC requires some unpleasant compromises. But if you want someone to be able to add a Postgres member to a heterogenous cluster, you're going to need to be able to accept some compromises, because the DBA (or, more likely, his management) already has. I'm not sure that 2PC is actually intended to increase reliability or performance, by the way. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Bruce Momjian writes: > If you want cross-server transactions, what other methods are there that > are more reliable? 3-phase commit -- Peter Eisentraut peter_e@gmx.net
> > Why would you spent time on implementing a mechanism whose ultimate > > benefit is supposed to be increasing reliability and performance, when you > > already realize that it will have to lock up at the slightest sight of > > trouble? There are better mechanisms out there that you can use instead. > > If you want cross-server transactions, what other methods are there that > are more reliable? It seems network unreliability is going to be a > problem no matter what method you use. And unless you have 2-phase (or 3-phase) commit, all other methods are going to be worse, since their time window for possible critical failure is going to be substantially larger. (extending 2-phase to 3-phase should not be too difficult) A lot of use cases for 2PC are not for manipulating the same data on more than one server (replication), but different data that needs to be manipulated in an all or nothing transaction. In this scenario it is not about reliability but about physically locating data (e.g. in LA vs New York) where it is needed most often. Andreas
Bruce Momjian wrote: > Peter Eisentraut wrote: > >>Andrew Sullivan writes: >> >>>Does the proposal of allowing dbas to run that risk, provided there's a >>>mechanism to tell them about it, satisfy the objection (assuming, of >>>course, 2PC can be turned off)? >> >>Why would you spent time on implementing a mechanism whose ultimate >>benefit is supposed to be increasing reliability and performance, when you >>already realize that it will have to lock up at the slightest sight of >>trouble? There are better mechanisms out there that you can use instead. > > If you want cross-server transactions, what other methods are there that > are more reliable? It seems network unreliability is going to be a > problem no matter what method you use. What is the stated goal of distributed transactions in PostgreSQL? 1) XA-compatibility/interoperability or 2) Robustness in the face of network failure The implementation choosen depends upon the answer, does it not? Is there an implementation (e.g. 3PC) that can simulate 2PC behavior for interoperability purposes and satisfy both requirements? Mike Mascari mascarm@mascari.com
Peter Eisentraut wrote: > Bruce Momjian writes: > > > If you want cross-server transactions, what other methods are there that > > are more reliable? > > 3-phase commit OK, how is that going to make thing safer, or does it just shrink the failure window smaller? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 2003-10-09 at 11:14, Peter Eisentraut wrote: > Bruce Momjian writes: > > > If you want cross-server transactions, what other methods are there that > > are more reliable? > > 3-phase commit How about a real world example of a transaction manager that has actually implemented 3PC? But yes, the ability for the participants to talk to each-other in the event the controller is unavailable seems an obvious fix.
On Thu, Oct 09, 2003 at 11:22:05AM -0400, Mike Mascari wrote: > The implementation choosen depends upon the answer, does it not? Is > there an implementation (e.g. 3PC) that can simulate 2PC behavior for > interoperability purposes and satisfy both requirements? I don't know. What I know is that someone showed up working on 2PC, and got a frosty reception. I'm trying to learn what criteria would make the work acceptable. For my purposes, the feature would be really nice, so I'd hate to see the opportunity lost. If someone has an idea even how 3PC might be implemented, I'd be happy to hear it. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, 2003-10-09 at 12:07, Andrew Sullivan wrote: > On Thu, Oct 09, 2003 at 11:22:05AM -0400, Mike Mascari wrote: > > The implementation choosen depends upon the answer, does it not? Is > > there an implementation (e.g. 3PC) that can simulate 2PC behavior for > > interoperability purposes and satisfy both requirements? > > I don't know. What I know is that someone showed up working on 2PC, > and got a frosty reception. I'm trying to learn what criteria would > make the work acceptable. For my purposes, the feature would be > really nice, so I'd hate to see the opportunity lost. If someone has > an idea even how 3PC might be implemented, I'd be happy to hear it. > Can you elaborate on "your purposes"? Do they fall into the "XA-compatibility" bit or the "Robustness in the face of network failure"? On the likely chance that 50% fall into 1 and the other into 2, can we accept a solution than doesn't address both? Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Thu, Oct 09, 2003 at 02:17:28PM -0400, Robert Treat wrote: > Can you elaborate on "your purposes"? Do they fall into the > "XA-compatibility" bit or the "Robustness in the face of network > failure"? Yes. I don't think that 2PC is a solution for robustness in face of network failure. It's too slow, to begin with. Some sort of multi-master system is very desirable for network failures, &c., but I don't think anybody does active/hot standby with 2PC any more; the performance is too bad. I'm interested in the ability to use it for XA(ish) compatibility and heterogenous database support. Arguments with people-who-think-Gartner-reports-are-good-guides-for-what-to-do would be a lot easier if I had that, to begin with. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> Yes. I don't think that 2PC is a solution for robustness in face of > network failure. It's too slow, to begin with. Some sort of > multi-master system is very desirable for network failures, &c., but > I don't think anybody does active/hot standby with 2PC any more; the > performance is too bad. I'm tired of this kind of "2PC is too slow" arguments. I think Satoshi, the only guy who made a trial implementation of 2PC for PostgreSQL, has already showed that 2PC is not that slow. -- Tatsuo Ishii
Tatsuo Ishii wrote: > > Yes. I don't think that 2PC is a solution for robustness in face of > > network failure. It's too slow, to begin with. Some sort of > > multi-master system is very desirable for network failures, &c., but > > I don't think anybody does active/hot standby with 2PC any more; the > > performance is too bad. > > I'm tired of this kind of "2PC is too slow" arguments. I think > Satoshi, the only guy who made a trial implementation of 2PC for > PostgreSQL, has already showed that 2PC is not that slow. Agreed. Let's get it into 7.5 and see it in action. If we need to adjust it, we can, but right now, we need something for distributed transactions, and this seems like the logical direction. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 10 Oct 2003, Tatsuo Ishii wrote: > > Yes. I don't think that 2PC is a solution for robustness in face of > > network failure. It's too slow, to begin with. Some sort of > > multi-master system is very desirable for network failures, &c., but > > I don't think anybody does active/hot standby with 2PC any more; the > > performance is too bad. > > I'm tired of this kind of "2PC is too slow" arguments. I think > Satoshi, the only guy who made a trial implementation of 2PC for > PostgreSQL, has already showed that 2PC is not that slow. Where does Satoshi's implementation sit right now? Will it patch to v7.4? Can it provide us with a base to work from, or is it complete?
The world rejoiced as t-ishii@sra.co.jp (Tatsuo Ishii) wrote: > I'm tired of this kind of "2PC is too slow" arguments. I think > Satoshi, the only guy who made a trial implementation of 2PC for > PostgreSQL, has already showed that 2PC is not that slow. I'm tired of it for a different reason, namely that there are "use cases" where speed is not _relevant_. The REAL problem that is taking place is that people are talking past each other. - Some say, "It's too slow; no point in doing it." The fact that it may be too slow _for them_ means they probably shouldn't use it. I somehow doubt that there are VastlyFaster alternatives waiting in the wings. - The other problem that gets pointed out: "2PC is inherently fragile, and prone to deadlock." Again, those that _need_ to use 2PC will forcibly need to address those concerns in the way they manage their systems. Those that can't afford the fragility are not 'customers' for use of 2PC. And, pointing back to the speed controversy,it is not at all obvious that there is any other alternative for handling distributed processing that _totallyaddresses_ the concerns about fragility. Those that can't afford these costs associated with 2PC will simply Not Use It. Probably in much the same way that most people _aren't_ using replication. And most people _aren't_ using PL/R. And most people _aren't_ using any number of the contributed things. If 2PC gets implemented, that simply means that there will be another module that some will be interested in, and which many people won't bother using. Which shouldn't seem to be a particularly big deal. -- "aa454","@","freenet.carleton.ca" http://www.ntlug.org/~cbbrowne/ The way to a man's heart is with a broadsword.
I was wondering whether we need to keep WAL online for 2PC, or whether only something like clog is sufficient. What if:1. phase 1 commit must pass the slave xid that will be used for 2nd phase (it needs to return some sort of identificationanyway)2. the coordinator must keep a list of slave xid's along with corresponding (commit/rollback) info Is that not sufficient ? Why would WAL be needed in the first place ? This is not replication, the slave has it's own WAL anyway. I also don't buy the argument with the lockup. Iff today somebody connects with psql starts a transaction modifies something and then never commits or aborts there is also no automatism builtin that will eventually kill it automatically. 2PC will simply need to have means for the administrator to rollback/commit an in doubt transaction manually. Andreas
On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote: > Satoshi, the only guy who made a trial implementation of 2PC for > PostgreSQL, has already showed that 2PC is not that slow. If someone has a fast implementation, so much the better. I'm not opposed to fast implementations! A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, Oct 09, 2003 at 11:53:46PM -0400, Christopher Browne wrote: > > If 2PC gets implemented, that simply means that there will be another > module that some will be interested in, and which many people won't > bother using. Which shouldn't seem to be a particularly big deal. I think the reason this is controversial, however, is that while PL/R (e.g.) doesn't make big changes to the internals, 2PC certainly will touch the fundamentals. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> wrote: > On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote: > > Satoshi, the only guy who made a trial implementation of 2PC for > > PostgreSQL, has already showed that 2PC is not that slow. > > If someone has a fast implementation, so much the better. I'm not > opposed to fast implementations! The pgbench results of my experimental 2PC implementation and plain postgresql are available. PostgreSQL 7.3 http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log Experimental 2PC in PostgreSQL 7.3 http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log I can't see a grave overhead from this comparison. > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Afilias Canada Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > -- NAGAYASU Satoshi <snaga@snaga.org>
> -----Original Message----- > From: Satoshi Nagayasu [mailto:pgsql@snaga.org] > Sent: Friday, October 10, 2003 12:26 PM > To: Andrew Sullivan > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] 2-phase commit > > Andrew Sullivan <andrew@libertyrms.info> wrote: > > On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote: > > > Satoshi, the only guy who made a trial implementation of 2PC for > > > PostgreSQL, has already showed that 2PC is not that slow. > > > > If someone has a fast implementation, so much the better. I'm not > > opposed to fast implementations! > > The pgbench results of my experimental 2PC implementation > and plain postgresql are available. > > PostgreSQL 7.3 > http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log > > Experimental 2PC in PostgreSQL 7.3 > http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log > > I can't see a grave overhead from this comparison. 2PC is absolutely essential when you have to have both parts of the transaction complete for a logical unit of work. For a project that needs it, if you don't have it you will be forced to go to another tool, or perform lots of custom programming to work around it. If you have 2PC and it is ten times slower than without it, you will still need it for projects requiring that capability. Now, a good model to start with is a very good idea. So some discussion and analysis is a good thing. From the looks of it, Satoshi Nagayasu has done a very good job. Having a functional 2PC would be a huge feather in the cap of PostgreSQL. IMO-YMMV
Martha Stewart called it a Good Thing whenDCorbit@connx.com ("Dann Corbit")wrote: >> I can't see a grave overhead from this comparison. > > 2PC is absolutely essential when you have to have both parts of the > transaction complete for a logical unit of work. For a project that > needs it, if you don't have it you will be forced to go to another > tool, or perform lots of custom programming to work around it. > > If you have 2PC and it is ten times slower than without it, you will > still need it for projects requiring that capability. Just so. I would be completely unsurprised if an attempt to use 2PC to support generalized "multimaster replication" would involve 10-fold slowdowns as compared to having all the activity take place on one database. Which would imply that 2PC is not a tool that may be appropriately used to naively do replication. But that should not come as any grand surprise. To each tool the right job, and to each job the right tool... There seems to be enough room for there to be evidence both of 2PC being useful for improving performance, and for it to cut performance: - TPC benchmarks often specify the inclusion of Tuxedo as a component; the combination of vendors would surely NOT put it on the list if it were not an aid to performance; - There is also indication that there can be a cost, notably in the form of the concerns of deadlock, but it should alsobe obvious that slow network links would lead to _hideous_ increases in latency. As you say, even if there is a substantial cost, it's still worthwhile if a project needs it. > Now, a good model to start with is a very good idea. So some > discussion and analysis is a good thing. From the looks of it, > Satoshi Nagayasu has done a very good job. Having a functional 2PC > would be a huge feather in the cap of PostgreSQL. It would seem so. I look forward to seeing how this progresses. -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org'). http://cbbrowne.com/info/linuxdistributions.html "XFS might (or might not) come out before the year 3000. As far as kernel patches go, SGI are brilliant. As far as graphics, especially OpenGL, go, SGI is untouchable. As far as filing systems go, a concussed doormouse in a tarpit would move faster." -- jd on Slashdot
Why not apply the effort to something already done and compatibly licensed? This: http://dog.intalio.com/ots.html Appears to be a Berkeley style licensed: http://dog.intalio.com/license.html Transaction monitor. "Overview The OpenORB Transaction Service is a very scalable transaction monitor which also provides several extensions like XA management, a management interface to control all transaction processes and a high reliable recovery system. By coordinating OpenORB and OpenORB Transaction Service, you provide a reliable and powerful foundation for building large scalable distributed applications. Datasheet The OpenORB Transaction Service is a fully compliant implementation of the OMG Transaction Service specification. The OpenORB Transaction Service features are : Management of distributed transactions with a two phase commit protocol Sub Transactions management ( nested transactions ) Propagation of the transaction context between CORBA objectsManagement of distributed transactions propagation through databases with the XA protocol Automatic logs to be able to make recovery in case of failures Can be used as a transaction initiatoror subordinate High-performance, multiple thread architecture Developed with POA Provides a management interfaceto control all transactions Full support of JTA JDBC pooling and automatic resource enlistment Download To download the OpenORB Transaction Service, do one of the following : CVS : you can use CVS to grab the sources directly. FTP : you get either a CVS snapshot or a prebuilt version To use one of these possibilities, go to the Download Services page. ChangeLog August 15th 2001. Version 1.2.0. Changed the transaction client side to support late binding to the transaction monitor. Bug fixed in the transactional client interceptor. This bug was due to a change in the OpenORB behavior concerning the slot To get previous change log, please refer to the CHANGELOG file available within this service distribution."
Here is a sourceforge version of the same thing http://openorb.sourceforge.net/ > -----Original Message----- > From: Dann Corbit > Sent: Friday, October 10, 2003 9:38 PM > To: Christopher Browne; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] 2-phase commit > > > Why not apply the effort to something already done and > compatibly licensed? > > This: > http://dog.intalio.com/ots.html > > Appears to be a Berkeley style licensed: > http://dog.intalio.com/license.html > > Transaction monitor. > > "Overview > The OpenORB Transaction Service is a very scalable > transaction monitor which also provides several extensions > like XA management, a management interface to control all > transaction processes and a high reliable recovery system. > > By coordinating OpenORB and OpenORB Transaction Service, you > provide a reliable and powerful foundation for building large > scalable distributed applications. > > Datasheet > The OpenORB Transaction Service is a fully compliant > implementation of the OMG Transaction Service specification. > The OpenORB Transaction Service features are : > Management of distributed transactions with a two phase > commit protocol > Sub Transactions management ( nested transactions ) > Propagation of the transaction context between CORBA objects > Management of distributed transactions propagation through > databases with the XA protocol > Automatic logs to be able to make recovery in case of failures > Can be used as a transaction initiator or subordinate > High-performance, multiple thread architecture > Developed with POA > Provides a management interface to control all transactions > Full support of JTA > JDBC pooling and automatic resource enlistment > > > Download > To download the OpenORB Transaction Service, do one of the > following : > CVS : you can use CVS to grab the sources directly. > FTP : you get either a CVS snapshot or a prebuilt version > To use one of these possibilities, go to the Download Services page. > > ChangeLog > August 15th 2001. Version 1.2.0. > Changed the transaction client side to support late binding > to the transaction monitor. > Bug fixed in the transactional client interceptor. This bug > was due to a change in the OpenORB behavior concerning the slot > > > To get previous change log, please refer to the CHANGELOG > file available within this service distribution." > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > http://www.postgresql.org/docs/faqs/FAQ.html
On Fri, Oct 10, 2003 at 09:37:53PM -0700, Dann Corbit wrote: > Why not apply the effort to something already done and compatibly > licensed? > > This: > http://dog.intalio.com/ots.html > > Appears to be a Berkeley style licensed: > http://dog.intalio.com/license.html > > Transaction monitor. I'd say this is complementary, not an alternative to 2PC implementation issues. The transaction monitor lives on the other side of the problem. 2PC is needed in the database _so that_ the transaction monitor can do its job. That said, having a 3-tier model is probably a good idea if distributed transaction management is what we want. :-) Jeroen
> -----Original Message----- > From: Jeroen T. Vermeulen [mailto:jtv@xs4all.nl] > Sent: Saturday, October 11, 2003 5:36 AM > To: Dann Corbit > Cc: Christopher Browne; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] 2-phase commit > > > On Fri, Oct 10, 2003 at 09:37:53PM -0700, Dann Corbit wrote: > > Why not apply the effort to something already done and compatibly > > licensed? > > > > This: > > http://dog.intalio.com/ots.html > > > > Appears to be a Berkeley style licensed: > > http://dog.intalio.com/license.html > > > > Transaction monitor. > > I'd say this is complementary, not an alternative to 2PC > implementation issues. My notion is that the specification has been created that describes how the system should operate, what the API's are, etc. I think that most of the work is involved in that area. The notion is that if you program to this spec, it will already have been well thought out and it should be standards based when completed. > The transaction monitor lives on the other side of the > problem. 2PC is needed in the database _so that_ the > transaction monitor can do its job. Theoretically, if any database in the chain supports 2PC, you could make all connected systems 2PC compliant by using the one functional system as a persistent store. But you are right. PostgreSQL still would need the "I promise to commit when you ask" method if it is to really support it. I think another way it could be handled is with nested transactions. Just have the promise phase be an inner transaction commit but have an outer transaction bracket that one for the actual commit. > That said, having a 3-tier model is probably a good idea if > distributed transaction management is what we want. :-) In real life, I think it is _always_ done this way.
> I think another way it could be handled is with nested transactions. > Just have the promise phase be an inner transaction commit but have an > outer transaction bracket that one for the actual commit. Not really. In the event of a crash, most 2PC systems will expect the participant to come back in the same state it crashed in. Our nested-transaction implementation (like our standard transaction implementation) aborts all transactions on crash.
On Monday 13 October 2003 20:11, Rod Taylor wrote: > > I think another way it could be handled is with nested transactions. > > Just have the promise phase be an inner transaction commit but have an > > outer transaction bracket that one for the actual commit. > > Not really. In the event of a crash, most 2PC systems will expect the > participant to come back in the same state it crashed in. > Yes, this is correct. There are certain phases of the protocol in which the transaction state must be re-instated from the log file after a crash of the DB server. The re-instatement must occur prior to any connections being accepted by the server. Additionally, the coordinator must be fully recoverable as well. The coordinator may, depending on the phase of the commit/abort, contact child servers after it crashes. The requirement is that during log replay, the transaction structures might have to be fully reconstructed and remain in-place after log replay has completed, until the disposition of the (sub)transaction is settled by the coordinator. All dependent on the phase of course. > Our nested-transaction implementation (like our standard transaction > implementation) aborts all transactions on crash. Jordan Henderson
Bruce Momjian wrote: > Tatsuo Ishii wrote: >> > Yes. I don't think that 2PC is a solution for robustness in face of >> > network failure. It's too slow, to begin with. Some sort of >> > multi-master system is very desirable for network failures, &c., but >> > I don't think anybody does active/hot standby with 2PC any more; the >> > performance is too bad. >> >> I'm tired of this kind of "2PC is too slow" arguments. I think >> Satoshi, the only guy who made a trial implementation of 2PC for >> PostgreSQL, has already showed that 2PC is not that slow. > > Agreed. Let's get it into 7.5 and see it in action. If we need to > adjust it, we can, but right now, we need something for distributed > transactions, and this seems like the logical direction. > Are you guy's kidding or what? 2PC is not too slow in normal operations when everything is purring like little kittens and you're just wasting your excess bandwidth on it. The point is that it behaves horrible and like a dirty backstreet cat at the time when things go wrong ... basically it's a neat thing to have, but from the second you need it it becomes useless. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > 2PC is not too slow in normal operations when everything is purring > like little kittens and you're just wasting your excess bandwidth on > it. The point is that it behaves horrible and like a dirty backstreet > cat at the time when things go wrong ... basically it's a neat thing > to have, but from the second you need it it becomes useless. I can't see anyone being forced to use it once it maybe/is supported. Like many tools, "ouch!" is a good reaction when used untrained/incorrectly. Peter
>>I'm tired of this kind of "2PC is too slow" arguments. I think >>Satoshi, the only guy who made a trial implementation of 2PC for >>PostgreSQL, has already showed that 2PC is not that slow. > > > Where does Satoshi's implementation sit right now? Will it patch to v7.4? > Can it provide us with a base to work from, or is it complete? It is not ready yet. You can find it at ... http://snaga.org/pgsql/ It is based on 7.3 * the 2-phase commit protocol (precommit and commit) * the multi-master replication using 2PC * distributed transaction(distributed query) current work * restarting (from 2nd phase) when the session is disconnected in 2nd phase (XLOG stuffs) * XA compliance future work * hot failover and recovery in PostgreSQL cluster * data partitioning on different servers I have compiled it a while ago. Seems to be pretty nice :). Hans -- Cybertec Geschwinde u Schoenig Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/2952/30706 or +43/660/816 40 77 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
On Thu, 9 Oct 2003, Bruce Momjian wrote: > Agreed. Let's get it into 7.5 and see it in action. If we need to > adjust it, we can, but right now, we need something for distributed > transactions, and this seems like the logical direction. I've started working on two-phase commits last week, and the very basic stuff is now working. Still a lot of bugs though. I posted the stuff I've put together to patches-list. I'd appreciate any comments. - Heikki
>>Why would you spent time on implementing a mechanism whose ultimate >>benefit is supposed to be increasing reliability and performance, when you >>already realize that it will have to lock up at the slightest sight of >>trouble? There are better mechanisms out there that you can use instead. > > > If you want cross-server transactions, what other methods are there that > are more reliable? It seems network unreliability is going to be a > problem no matter what method you use. > I guess we need something like PITR to make this work because otherwise I cannot see a way to get in sync again. Maybe I should call the desired mechanism "Entire cluster back to transaction X recovery". Did anybody hear about PITR recently? How else would you recover from any kind of problem? No matter what you are doing network reliability will be a problem so we have to live with it. Having some "going back to something consistent" is necessary anyway. People might argue now that committed transactions might be lost. If people knew which ones, its ok. 90% of all people will understand that in case of a crash something evil might happen. Hans -- Cybertec Geschwinde u Schoenig Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/2952/30706 or +43/660/816 40 77 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
Satoshi, can you get this ready for inclusion in 7.5? We need a formal proposal of how it will work from the user's perspective (new commands?), and how it will internally work. It seem Heikki Linnakangas has also started working on this and perhaps he can help. Ideally, we should have this proposal when we start 7.5 development in a few weeks. I know some people have concerns about 2-phase commit, from a performance perspective and from a network failure perspective, but I think there are enough people who want it that we should see how this can be implemented with the proper safeguards. --------------------------------------------------------------------------- Satoshi Nagayasu wrote: > > Andrew Sullivan <andrew@libertyrms.info> wrote: > > On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote: > > > Satoshi, the only guy who made a trial implementation of 2PC for > > > PostgreSQL, has already showed that 2PC is not that slow. > > > > If someone has a fast implementation, so much the better. I'm not > > opposed to fast implementations! > > The pgbench results of my experimental 2PC implementation > and plain postgresql are available. > > PostgreSQL 7.3 > http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log > > Experimental 2PC in PostgreSQL 7.3 > http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log > > I can't see a grave overhead from this comparison. > > > > > A > > > > -- > > ---- > > Andrew Sullivan 204-4141 Yonge Street > > Afilias Canada Toronto, Ontario Canada > > <andrew@libertyrms.info> M2P 2A8 > > +1 416 646 3304 x110 > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 8: explain analyze is your friend > > > > > -- > NAGAYASU Satoshi <snaga@snaga.org> > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce, Ok, I will write my proposal. BTW, my 2PC work is now suspended because of my master thesis. My master thesis will (must) be finished in next few months. To finish 2PC work, I feel 2 or 3 months are needed after that. Bruce Momjian wrote: > Satoshi, can you get this ready for inclusion in 7.5? We need a formal > proposal of how it will work from the user's perspective (new > commands?), and how it will internally work. It seem Heikki Linnakangas > has also started working on this and perhaps he can help. > > Ideally, we should have this proposal when we start 7.5 development in a > few weeks. > > I know some people have concerns about 2-phase commit, from a > performance perspective and from a network failure perspective, but I > think there are enough people who want it that we should see how this > can be implemented with the proper safeguards. > > --------------------------------------------------------------------------- > > Satoshi Nagayasu wrote: > >>Andrew Sullivan <andrew@libertyrms.info> wrote: >> >>>On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote: >>> >>>>Satoshi, the only guy who made a trial implementation of 2PC for >>>>PostgreSQL, has already showed that 2PC is not that slow. >>> >>>If someone has a fast implementation, so much the better. I'm not >>>opposed to fast implementations! >> >>The pgbench results of my experimental 2PC implementation >>and plain postgresql are available. >> >>PostgreSQL 7.3 >> http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log >> >>Experimental 2PC in PostgreSQL 7.3 >> http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log >> >>I can't see a grave overhead from this comparison. >> >> >>>A >>> >>>-- >>>---- >>>Andrew Sullivan 204-4141 Yonge Street >>>Afilias Canada Toronto, Ontario Canada >>><andrew@libertyrms.info> M2P 2A8 >>> +1 416 646 3304 x110 >>> >>> >>>---------------------------(end of broadcast)--------------------------- >>>TIP 8: explain analyze is your friend >>> >> >> >>-- >>NAGAYASU Satoshi <snaga@snaga.org> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 6: Have you searched our list archives? >> >> http://archives.postgresql.org >> > > -- NAGAYASU Satoshi <snaga@snaga.org>
Satoshi Nagayasu wrote: > Bruce, > > Ok, I will write my proposal. > > BTW, my 2PC work is now suspended because of my master thesis. > My master thesis will (must) be finished in next few months. > > To finish 2PC work, I feel 2 or 3 months are needed after that. Oh, OK, that is helpful. Perhaps Heikki Linnakangas could help too. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 10 Oct 2003, Heikki Linnakangas wrote: > On Thu, 9 Oct 2003, Bruce Momjian wrote: > > > Agreed. Let's get it into 7.5 and see it in action. If we need to > > adjust it, we can, but right now, we need something for distributed > > transactions, and this seems like the logical direction. > > I've started working on two-phase commits last week, and the very > basic stuff is now working. Still a lot of bugs though. I have done more work on my 2PC commit patch. I still need to work out notifications and CREATE statements, but otherwise I'm quite happy with it now. I received no feedback on the first version, so I'll try to clarify how it works a bit. The patch is against the current cvs tip. I'll post it to the patches-list, and you can also grab it from here: http://www.hut.fi/~hlinnaka/twophase2.diff The patch introduces three new commands, PREPCOMMIT, COMMITPREPARED and ABORTPREPARED. PREPCOMMIT is called in place of COMMIT, to put the active transaction block into prepared state. PREPCOMMIT takes a string argument that becomes the Global Transaction Identifier (GID) for the transaction. The GID is used as a handle to COMMITPREPARED/ABORTPREPARED commands to finish the 2nd phase commit. After the PREPCOMMIT command finishes, the transaction is no longer associated with any specific backend. COMMITPREPARED/ABORTPREPARED commands are used to finish the prepared transaction. They can be issued from any backend. There's also a new system view, pg_prepared_xacts that show all prepared transactions. Here's a little step-by-step tutorial to trying out the patch: --------- 1. apply patch, patch -p0 < twophase2.diff 2. compile 3. create a new database system with initdb. 4. run postmaster 5. psql template1 6. CREATE TABLE foobar (a integer); 7. INSERT INTO foobar values (1); 8. BEGIN; UPDATE foobar SET a = 2 WHERE a = 1; 9. SELECT * FROM foobar; 10. PREPCOMMIT 'foobar_update1'; The transaction is now in prepared state, and it's no longer associated with this backend, as you can see by issuing: 11. SELECT * FROM foobar; 12. SELECT * FROM pg_prepared_xacts; Let's commit it then. 13. COMMITPREPARED 'foobar_update1'; 14. SELECT * FROM pg_prepared_xacts; 15. SELECT * FROM foobar; Next repeat steps 8-15 but try killing postmaster somewhere after step 9, and observe that the transaction is not lost. Also try doing another update with a different backend, and see that the locks held by the prepared transaction survive the crash. -------- I also took a look at Satoshis patches. The main difference is that his implementation made modifications to the BE/FE protocol, while my implementation works at the statement level. His patches don't handle shutdowns or broken connections yet, but that was on his TODO list. When I started working on 2PC, I didn't know about Satoshis patches, otherwise I probably would have took them as a starting point. The next step is going to be writing 2PC support to the JDBC driver using the new backend commands. XA interface would be very nice too, but I'm personally not that interested in that. Any volunteers? Please comment! I'd like to know what you guys think about this. Am I heading into the right direction? Some people have expressed concerns about performance issues with 2PC in general. Please note that this patch doesn't change the traditional commit routines, so it won't affect you performance if you don't use 2PC. - Heikki
Of course I have no time to work on it : (, but in my opinion XA interface and support for the JDBC driver is absolutely necessary. I think that 2pc will generally be used more for supporting 2pc transactions between the DB and JMS than it would be for 2pc across 2 db's. Glad to see some progress on 2PC with Postgres though. Later Rob > > The next step is going to be writing 2PC support to the JDBC driver using > the new backend commands. XA interface would be very nice too, but I'm > personally not that interested in that. Any volunteers? > > Please comment! I'd like to know what you guys think about this. Am I > heading into the right direction? >