Thread: 2-phase commit

2-phase commit

From
Andrew Sullivan
Date:
Hi,

As the 7.4 beta rolls on, I thought now would be a good time to start
talking about the future.  

I have a potential need in the future for distributed transactions
(XA).  To get that from Postgres, I'd need two-phase commit, I think. 
There is someone working on such a project
(<http://snaga.org/pgsql/>), but last time it was discussed here, it
received a rather lukewarm reception (see, e.g., the thread starting
at
<http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>).

While at OSCON, I had a discussion with Joe Conway, Bruce Momjian,
and Greg Sabino Mullane about 2PC.  Various people expressed various
opinions on the topic, but I think we agreed on the following.  The
relevant folks can correct me if I'm wrong:

Two-phase commit has theoretical problems, but it is implemented in
several "enterprise" RDBMS.  2PC is something needed by certain kinds
of clients (especially those with transaction managers), so if
PostgreSQL doesn't have it, PostgreSQL just won't get supported in
that arena.  Someone is already working on 2PC, but may feel unwanted
due to the reactions last heard on the topic, and may not continue
working unless he gets some support.  What is a necessary condition
for such support is to get some idea of what compromises 2PC might
impose, and thereafter to try to determine which such compromises, if
any, are acceptable ones.

I think the idea here is that, while in most cases a "pretty-good"
implementation of a desirable feature might get included in the
source on the grounds that it can always be improved upon later,
something like 2PC has the potential to do great harm to an otherwise
reliable transaction manager.  So the arguments about what to do need
to be aired in advance. 

I (perhaps foolishly) volunteered to undertake to collect the
arguments in various directions, on the grounds that I can contribute
no code, but have skin made of asbestos.  I thought I'd try to
collect some information about what people think the problems and
potentially acceptable compromises are, to see if there is some way
to understand what can and cannot be contemplated for 2PC.  I'll
include in any such outline the remarks found in the -hackers thread
referenced above.  Any objections?

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Christopher Browne
Date:
In an attempt to throw the authorities off his trail, andrew@libertyrms.info (Andrew Sullivan) transmitted:
> As the 7.4 beta rolls on, I thought now would be a good time to start
> talking about the future.  
>
> I have a potential need in the future for distributed transactions
> (XA).  To get that from Postgres, I'd need two-phase commit, I think. 
> There is someone working on such a project
> (<http://snaga.org/pgsql/>), but last time it was discussed here, it
> received a rather lukewarm reception (see, e.g., the thread starting
> at
> <http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>).

Interesting/positive news on this front; the XA specification
documents are now all available in PDF form "freely", from the Open
Group, where they used to be fairly pricey.

<http://www.opengroup.org/publications/catalog/tp.htm>

Another notable XA documentation source is here...
<http://www.middleware.net/tuxedo/resources/XA_Documentation.html>

Two interesting implications of XA support would be that there could
be some "congruence of interests" that would arise regarding two
vendors:

- XA is essentially based on the API of BEA Tuxedo.  I'm told they
include a simple database system with Tuxedo, but nothing particularly
wonderful.  (Who thinks of BEA as a DBMS vendor???)  They might have
interest in bundling something better...

- The main Tuxedo reseller that I am aware of is PeopleSoft, who use
it for their "high traffic" clients.  Anyone that has seen news lately
knows that they and Oracle aren't exactly "best pals" these days;
having another DB option could be helpful to them...
-- 
(format nil "~S@~S" "aa454" "freenet.carleton.ca")
http://www3.sympatico.ca/cbbrowne/tpmonitor.html
"In order to make an apple pie from scratch, you must first create the
universe."  -- Carl Sagan, Cosmos


Re: 2-phase commit

From
"Jeroen T. Vermeulen"
Date:
On Tue, Aug 26, 2003 at 08:04:13PM -0400, Christopher Browne wrote:
> 
> Interesting/positive news on this front; the XA specification
> documents are now all available in PDF form "freely", from the Open
> Group, where they used to be fairly pricey.
A step in the right direction, but AFAIC it's too little, too late.
The impression I get, at least, is that it's as good as dead now: Java
may use it, but it hides the details anyway so it might as well not be
there--the Java way is to standardize the API but nothing that goes "on
the wire".  

Lots of proprietary middleware uses XA, but from what I hear there are
enough subtle differences to make mixing-and-matching of products risky
at best--the proprietary way is to bundle products that will work at
least marginally together, and relegate standards to a bullshit point
in the PowerPoint presentations.  "Based on industry standard" means
about the same as "based on a true story."

Then there's the fact that the necessary followup standards never got 
anywhere, and the fact that XA doesn't cope with threading really well.

Don't get me wrong, XA support may well be a good thing.  But at this
stage, personally I'd go for a good 2PC implementation first and worry 
about supporting XA later.


Jeroen



Re: 2-phase commit

From
Bruce Momjian
Date:
I haven't seen any comment on this email.

From our previous discussion of 2-phase commit, there was concern that
the failure modes of 2-phase commit were not solvable.  However, I think
multi-master replication is going to have similar non-solvable failure
modes, yet people still want multi-master replication.

We have had several requests for 2-phase commit in the past month.  I
think we should encourage the Japanese group to continue on their
2-phase commit patch to be included in 7.5.  Yes, it will have
non-solvable failure modes, but let's discuss them and find an
appropriate way to deal with the failures.

---------------------------------------------------------------------------

Andrew Sullivan wrote:
> Hi,
> 
> As the 7.4 beta rolls on, I thought now would be a good time to start
> talking about the future.  
> 
> I have a potential need in the future for distributed transactions
> (XA).  To get that from Postgres, I'd need two-phase commit, I think. 
> There is someone working on such a project
> (<http://snaga.org/pgsql/>), but last time it was discussed here, it
> received a rather lukewarm reception (see, e.g., the thread starting
> at
> <http://archives.postgresql.org/pgsql-hackers/2003-06/msg00752.php>).
> 
> While at OSCON, I had a discussion with Joe Conway, Bruce Momjian,
> and Greg Sabino Mullane about 2PC.  Various people expressed various
> opinions on the topic, but I think we agreed on the following.  The
> relevant folks can correct me if I'm wrong:
> 
> Two-phase commit has theoretical problems, but it is implemented in
> several "enterprise" RDBMS.  2PC is something needed by certain kinds
> of clients (especially those with transaction managers), so if
> PostgreSQL doesn't have it, PostgreSQL just won't get supported in
> that arena.  Someone is already working on 2PC, but may feel unwanted
> due to the reactions last heard on the topic, and may not continue
> working unless he gets some support.  What is a necessary condition
> for such support is to get some idea of what compromises 2PC might
> impose, and thereafter to try to determine which such compromises, if
> any, are acceptable ones.
> 
> I think the idea here is that, while in most cases a "pretty-good"
> implementation of a desirable feature might get included in the
> source on the grounds that it can always be improved upon later,
> something like 2PC has the potential to do great harm to an otherwise
> reliable transaction manager.  So the arguments about what to do need
> to be aired in advance. 
> 
> I (perhaps foolishly) volunteered to undertake to collect the
> arguments in various directions, on the grounds that I can contribute
> no code, but have skin made of asbestos.  I thought I'd try to
> collect some information about what people think the problems and
> potentially acceptable compromises are, to see if there is some way
> to understand what can and cannot be contemplated for 2PC.  I'll
> include in any such outline the remarks found in the -hackers thread
> referenced above.  Any objections?
> 
> A
> 
> -- 
> ----
> Andrew Sullivan                         204-4141 Yonge Street
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M2P 2A8
>                                          +1 416 646 3304 x110
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Mike Mascari
Date:
Bruce Momjian wrote:
> I haven't seen any comment on this email.
> 
> From our previous discussion of 2-phase commit, there was concern that
> the failure modes of 2-phase commit were not solvable.  However, I think
> multi-master replication is going to have similar non-solvable failure
> modes, yet people still want multi-master replication.
> 
> We have had several requests for 2-phase commit in the past month.  I
> think we should encourage the Japanese group to continue on their
> 2-phase commit patch to be included in 7.5.  Yes, it will have
> non-solvable failure modes, but let's discuss them and find an
> appropriate way to deal with the failures.

FWIW, Oracle 8's manual for the recovery of a distributed tx where the
coordinator never comes back on line is:

https://www.ifi.uni-klu.ac.at/Public/Documentation/oracle/product/8.0.3/doc/server803/A54643_01/ch_intro.htm#7783

"If a database must be recovered to a point in the past, Oracle's
recovery facilities allow database administrators at other sites to
return their databases to the earlier point in time also. This ensures
that the global database remains consistent."

So it seems, for Oracle 8 at least, PITR is the method of recovery for
cohorts after unrecoverable coordinator failure.

Ugly and yet probably a prerequisite.

Mike Mascari
mascarm@mascari.com









Re: 2-phase commit

From
Bruce Momjian
Date:
Mike Mascari wrote:
> Bruce Momjian wrote:
> > I haven't seen any comment on this email.
> > 
> > From our previous discussion of 2-phase commit, there was concern that
> > the failure modes of 2-phase commit were not solvable.  However, I think
> > multi-master replication is going to have similar non-solvable failure
> > modes, yet people still want multi-master replication.
> > 
> > We have had several requests for 2-phase commit in the past month.  I
> > think we should encourage the Japanese group to continue on their
> > 2-phase commit patch to be included in 7.5.  Yes, it will have
> > non-solvable failure modes, but let's discuss them and find an
> > appropriate way to deal with the failures.
> 
> FWIW, Oracle 8's manual for the recovery of a distributed tx where the
> coordinator never comes back on line is:
> 
> https://www.ifi.uni-klu.ac.at/Public/Documentation/oracle/product/8.0.3/doc/server803/A54643_01/ch_intro.htm#7783
> 
> "If a database must be recovered to a point in the past, Oracle's
> recovery facilities allow database administrators at other sites to
> return their databases to the earlier point in time also. This ensures
> that the global database remains consistent."
> 
> So it seems, for Oracle 8 at least, PITR is the method of recovery for
> cohorts after unrecoverable coordinator failure.

Yep, I assume PITR would be the solution for most failure cases --- very
ugly of course.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> From our previous discussion of 2-phase commit, there was concern that
> the failure modes of 2-phase commit were not solvable.  However, I think
> multi-master replication is going to have similar non-solvable failure
> modes, yet people still want multi-master replication.

No.  The real problem with 2PC in my mind is that its failure modes
occur *after* you have promised commit to one or more parties.  In
multi-master, if you fail you know it before you have told the client
his data is committed.
        regards, tom lane


Re: 2-phase commit

From
"Jeroen T. Vermeulen"
Date:
On Tue, Sep 09, 2003 at 08:38:41PM -0400, Bruce Momjian wrote:
> 
> Yep, I assume PITR would be the solution for most failure cases --- very
> ugly of course.

Anything can be broken in some way, if bad luck is willing to work hard
enough.  In at least one, ah, competing company I know of, employees are
allowed by the legal people to say "assured" but not "guaranteed" for
precisely this reason.

First thing is an acceptable failure mode, then you try to narrow its
chances of occurring.  And if worst comes to worst, one example of an
acceptable failure mode is "when in danger or doubt, run in circles,
scream and shout."


Jeroen



Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> > From our previous discussion of 2-phase commit, there was concern that
> > the failure modes of 2-phase commit were not solvable.  However, I think
> > multi-master replication is going to have similar non-solvable failure
> > modes, yet people still want multi-master replication.
>
> No.  The real problem with 2PC in my mind is that its failure modes
> occur *after* you have promised commit to one or more parties.  In
> multi-master, if you fail you know it before you have told the client
> his data is committed.

Hmm ? The appl cannot take the first phase commit as its commit info. It
needs to wait for the second phase commit. The second phase is only finished
when all coservers have reported back. 2PC is synchronous.

The problems with 2PC are when after second phase commit was sent to all
servers and before all report back one of them becomes unreachable/down ...
(did it receive and do the 2nd commit or not) Such a transaction must stay
open until the coserver is reachable again or an administrator committed/aborted it.

It is multi master replication that usually has an asynchronous mode for
performance, and there the trouble starts.

Andreas


Re: 2-phase commit

From
Bruce Momjian
Date:
Zeugswetter Andreas SB SD wrote:
> 
> > > From our previous discussion of 2-phase commit, there was concern that
> > > the failure modes of 2-phase commit were not solvable.  However, I think
> > > multi-master replication is going to have similar non-solvable failure
> > > modes, yet people still want multi-master replication.
> > 
> > No.  The real problem with 2PC in my mind is that its failure modes
> > occur *after* you have promised commit to one or more parties.  In
> > multi-master, if you fail you know it before you have told the client
> > his data is committed.
> 
> Hmm ? The appl cannot take the first phase commit as its commit info. It 
> needs to wait for the second phase commit. The second phase is only finished
> when all coservers have reported back. 2PC is synchronous.
> 
> The problems with 2PC are when after second phase commit was sent to all
> servers and before all report back one of them becomes unreachable/down ...
> (did it receive and do the 2nd commit or not) Such a transaction must stay
> open until the coserver is reachable again or an administrator committed/aborted it. 
> 
> It is multi master replication that usually has an asynchronous mode for
> performance, and there the trouble starts.

Let me diagram this so we can see the issues.  Normal operation is:
Master        Slave------        -----commit ready-->        <--OKcommit done--->        <--OKcompleted

One possible failure is:
Master        Slave------        -----commit ready-->        <--OKcommit done--->        dies herestuck waiting

Another possible failure is:
Master        Slave------        -----commit ready-->        <--OKdies here        stuck waiting

Are these the issues?  Can't we just add GUC timeouts to cause the
commit to fail, and the slave to stop waiting?  I suppose a problem is:
Master        Slave------        -----commit ready-->        <--OKsleep        stuck waiting, times outcommit done

Could we allow slaves to check if the backend is still alive, perhaps by
asking the postmaster, similar to what we do with the cancel signal ---
that way, the slave would never time out and always wait if the master
was alive.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Could we allow slaves to check if the backend is still alive, perhaps by
> asking the postmaster, similar to what we do with the cancel signal ---
> that way, the slave would never time out and always wait if the master
> was alive.

You're not considering the possibility of a transient communication
failure.  The fact that you cannot currently contact the other guy
is not proof that he's not still alive.

Example:
Master        Slave------        -----commit ready-->        <--OKcommit done->XX

where "->XX" means the message gets lost due to network failure.  Now
what?  The slave cannot abort; he promised he could commit, and he does
not know whether the master has committed or not.  The master does not
know the slave's state either; maybe he got the second message, and
maybe he didn't.  Both sides are forced to keep information about the 
open transaction indefinitely.  Timing out on either side could yield
the wrong result.
        regards, tom lane


Re: 2-phase commit

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Could we allow slaves to check if the backend is still alive, perhaps by
> > asking the postmaster, similar to what we do with the cancel signal ---
> > that way, the slave would never time out and always wait if the master
> > was alive.
> 
> You're not considering the possibility of a transient communication
> failure.  The fact that you cannot currently contact the other guy
> is not proof that he's not still alive.
> 
> Example:
> 
>     Master        Slave
>     ------        -----
>     commit ready-->
>             <--OK
>     commit done->XX
> 
> where "->XX" means the message gets lost due to network failure.  Now
> what?  The slave cannot abort; he promised he could commit, and he does
> not know whether the master has committed or not.  The master does not
> know the slave's state either; maybe he got the second message, and
> maybe he didn't.  Both sides are forced to keep information about the 
> open transaction indefinitely.  Timing out on either side could yield
> the wrong result.

Can't the master re-send the request after a timeout?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Fri, 26 Sep 2003, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Could we allow slaves to check if the backend is still alive, perhaps by
> > asking the postmaster, similar to what we do with the cancel signal ---
> > that way, the slave would never time out and always wait if the master
> > was alive.
>
> You're not considering the possibility of a transient communication
> failure.  The fact that you cannot currently contact the other guy
> is not proof that he's not still alive.
>
> Example:
>
>     Master        Slave
>     ------        -----
>     commit ready-->
>             <--OK
>     commit done->XX
>
> where "->XX" means the message gets lost due to network failure.  Now

'k, but isn't alot of that a "retry" issue?  we're talking TCP here, not
UDP, which I *thought* was designed for transient network problems ... ?
I would think that any implementation would have a timeout/retry GUC
variable associated with it ... 'if no answer in x seconds, retry up to y
times' ...

if we are talking two computers sitting next to each other on a switch,
you'd expect those to be low ... but if you were talking about two
seperate geographical locations (and yes, I realize you are adding lag to
the mix with waiting for responses), you'd expect those #s to rise ...




Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Fri, 26 Sep 2003, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Could we allow slaves to check if the backend is still alive, perhaps by
> > asking the postmaster, similar to what we do with the cancel signal ---
> > that way, the slave would never time out and always wait if the master
> > was alive.
>
> You're not considering the possibility of a transient communication
> failure.  The fact that you cannot currently contact the other guy
> is not proof that he's not still alive.
>
> Example:
>
>     Master        Slave
>     ------        -----
>     commit ready-->
>             <--OK
>     commit done->XX
>
> where "->XX" means the message gets lost due to network failure.  Now
> what?

'k, but isn't alot of that a "retry" issue?  we're talking TCP here, not
UDP, which I *thought* was designed for transient network problems ... ?
I would think that any implementation would have a timeout/retry GUC
variable associated with it ... 'if no answer in x seconds, retry up to y
times' ...

if we are talking two computers sitting next to each other on a switch,
you'd expect those to be low ... but if you were talking about two
seperate geographical locations (and yes, I realize you are adding lag to
the mix with waiting for responses), you'd expect those #s to rise ...



Re: 2-phase commit

From
Patrick Welche
Date:
On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
... 
> if we are talking two computers sitting next to each other on a switch,
> you'd expect those to be low ... but if you were talking about two
> seperate geographical locations (and yes, I realize you are adding lag to
> the mix with waiting for responses), you'd expect those #s to rise ...

Which I thought was the whole point of using a group communication protocol
such as spread in postgresql-r. It seemed solved there...

Cheers,

Patrick


Re: 2-phase commit

From
Bruce Momjian
Date:
Patrick Welche wrote:
> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
> ... 
> > if we are talking two computers sitting next to each other on a switch,
> > you'd expect those to be low ... but if you were talking about two
> > seperate geographical locations (and yes, I realize you are adding lag to
> > the mix with waiting for responses), you'd expect those #s to rise ...
> 
> Which I thought was the whole point of using a group communication protocol
> such as spread in postgresql-r. It seemed solved there...

Right, but I think we want to try to do two-phase commit without spread.
Spread seems overkill for this usage.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> You're not considering the possibility of a transient communication
>> failure.

> Can't the master re-send the request after a timeout?

Not "it can", but "it has to".  The master *must* keep hold of that
request forever (or until the slave responds, or until we reconfigure
the system not to consider that slave valid anymore).  Similarly, the
slave cannot forget the maybe-committed transaction on pain of not being
a valid slave anymore.  You can make this work, but the resource costs
are steep.  For instance, in Postgres, you don't get to truncate the WAL
log, for what could be a really really long time --- more disk space
than you wanted to spend on WAL anyway.  The locks held by the
maybe-committed transaction are another potentially unpleasant problem;
you can't release them, no matter what else they are blocking.
        regards, tom lane


Re: 2-phase commit

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> You're not considering the possibility of a transient communication
> >> failure.
> 
> > Can't the master re-send the request after a timeout?
> 
> Not "it can", but "it has to".  The master *must* keep hold of that
> request forever (or until the slave responds, or until we reconfigure
> the system not to consider that slave valid anymore).  Similarly, the
> slave cannot forget the maybe-committed transaction on pain of not being
> a valid slave anymore.  You can make this work, but the resource costs
> are steep.  For instance, in Postgres, you don't get to truncate the WAL
> log, for what could be a really really long time --- more disk space
> than you wanted to spend on WAL anyway.  The locks held by the
> maybe-committed transaction are another potentially unpleasant problem;
> you can't release them, no matter what else they are blocking.

I think we would need a configurable timeout to say a slave is no longer
valid, like 60 seconds, and then let everyone release.  We can let the
administrator decide how long he wants to try to keep two hosts
communicating.  I don't see this as much different from multi-master
replication problems.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Fri, 26 Sep 2003, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> You're not considering the possibility of a transient communication
> >> failure.
>
> > Can't the master re-send the request after a timeout?
>
> Not "it can", but "it has to".  The master *must* keep hold of that
> request forever (or until the slave responds, or until we reconfigure
> the system not to consider that slave valid anymore).  Similarly, the
> slave cannot forget the maybe-committed transaction on pain of not being
> a valid slave anymore.

Hrmmmm ... is there no way of having part of the protocol being a message
sent back that its a valid/invalid slave?  ie. slave has an uncommitted
transaction, never hears back from master to actually do the commit, so
after x-secs * y-retries any messages it does try to send to the master
have a bit flag set to 'invalid'?



Re: 2-phase commit

From
Christopher Browne
Date:
pgman@candle.pha.pa.us (Bruce Momjian) writes:
> Patrick Welche wrote:
>> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
>> ... 
>> > if we are talking two computers sitting next to each other on a switch,
>> > you'd expect those to be low ... but if you were talking about two
>> > seperate geographical locations (and yes, I realize you are adding lag to
>> > the mix with waiting for responses), you'd expect those #s to rise ...
>> 
>> Which I thought was the whole point of using a group communication
>> protocol such as spread in postgresql-r. It seemed solved there...
>
> Right, but I think we want to try to do two-phase commit without
> spread.  Spread seems overkill for this usage.

Is there some big demerit to _having_ that "overkill"?  If there is no
major price to pay, then I don't see why it isn't reasonable to simply
say "Sure, we'll use that!"

After all, PostgreSQL is set up to do _everything_ inside
transactions, even though there are some actions you might take that
don't forcibly need to be transactional.  That's overkill, and nobody
(well, barring fans of Certain Other Databases) complains that it's
overkill.
-- 
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Fri, Sep 26, 2003 at 01:34:28PM -0400, Tom Lane wrote:
> 
> Example:
> 
>     Master        Slave
>     ------        -----
>     commit ready-->
>             <--OK
>     commit done->XX

> maybe he didn't.  Both sides are forced to keep information about the 
> open transaction indefinitely.  Timing out on either side could yield
> the wrong result.

If i understand the complaints, I think there are two big issues.

The first problem is the restart/rejoin problem.  When a 2PC member
goes away, it is supposed to come back with all its former locks and
everything in place, so that it can know what to do.  This is also
extremely tricky, but I think the answer is sort of easy.  A member
which re-joins without crashing (that is, it has open transactions,
&c.), it just has to complete its transactions with the other
member(s).  If other members have processed new transactions since
the member left, the member is kicked out.  It's not allowed to join
without being re-initialised.  A member which crashes is just a
special case of this.  This is not elegant, not nice, &c.  But I
don't think anyone can really guarantee that a crahsed member will
start up correctly (it crashed, after all; maybe there's a bug).  So
this is the safest approach, and I don't think it's a big deal.  It's
not cheap, of course, and there may be problems arising from the
conditions I describe below.  But I think they can be handled (see
the section on "compromises", below) intelligently.

The second, stickier problem is just as Tom describes.  When the
master is "Commit done" and that message doesn't make it to the other
host(s), you might have to wait forever.  Of course, that's not
acceptable.

But I can think of some options of how to decide to handle this. 
Note that these may not guarantee no loss of data.  That's not a
compromise one is usually willing to make; but just because I don't
want to accept that compromise doesn't mean it is unacceptable to
everyone.

Some possible compromises
=========================

1,    One machine always wins.  One could decide to pick one
machine that, in case of some sort of failure, always wins.  You need
some sort of heartbeat system which checks for the other member(s) of
the cluster.  In the event of failure, whatever is on the "winner"
machine is deemed to be correct, and everyone else has to lose.  If
the point of your 2PC is to provide synchronous access to high loads
of read-only clients, this would probably be a good solution, since
only one machine would ever see data changes.

2.    Quorum rule.  One could decide on a quorum of machines, and
the group which has quorum wins.  (Naturally, this has to be an
absolute majority.)  The quorum can continue to process queries, and
the folks who left the room have to re-sync to join.

3.    Fail to read-only status and let the DBA sort it out.  

4.    Mark the contentious rows as "bad" and let the DBA sort it
out.  This option is not dissimilar to what Access/SQL server
disconnected multi-master replication does.  It's not elegant, but it
might be a good answer for the cases where 2PC gets used.

Note that none of these can guarantee that some apparently committed
data will not later be lost.  To real database hounds, that will
sound like apostasy, but I suspect it is the sort of trade-off that
real products make all the time.  You have to have a way of
collecting the "yeah, we told you it was committed, but we lied" data
and being able to track it; and that has to be enough.  The real
security-of-data work is going to have to be done by ultra-reliable
hardware, good maintenance practices, &c.  Then when losses are down
in the .001% range from this sort of mistake, no one will care.

This is not, by the way, the fully-formed set of suggestions I said
I'd deliver when I started the thread; but since it came up again
today, I thought I'd respond with what I had so far.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Andrew Sullivan
Date:
On Fri, Sep 26, 2003 at 02:05:36PM -0400, Tom Lane wrote:
> a valid slave anymore.  You can make this work, but the resource costs
> are steep.  For instance, in Postgres, you don't get to truncate the WAL

But people who want 2PC are more than willing to pay all that cost. 

A
-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Fri, 26 Sep 2003, Christopher Browne wrote:

> pgman@candle.pha.pa.us (Bruce Momjian) writes:
> > Patrick Welche wrote:
> >> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
> >> ...
> >> > if we are talking two computers sitting next to each other on a switch,
> >> > you'd expect those to be low ... but if you were talking about two
> >> > seperate geographical locations (and yes, I realize you are adding lag to
> >> > the mix with waiting for responses), you'd expect those #s to rise ...
> >>
> >> Which I thought was the whole point of using a group communication
> >> protocol such as spread in postgresql-r. It seemed solved there...
> >
> > Right, but I think we want to try to do two-phase commit without
> > spread.  Spread seems overkill for this usage.
>
> Is there some big demerit to _having_ that "overkill"?  If there is no
> major price to pay, then I don't see why it isn't reasonable to simply
> say "Sure, we'll use that!"

Reliance on a third party library to be installed to provide the
functionality ... if it were meant as an "add on" instead of "standard
feature", then sure ...


Re: 2-phase commit

From
Rod Taylor
Date:
On Fri, 2003-09-26 at 13:58, Bruce Momjian wrote:
> Patrick Welche wrote:
> > On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
> > ...
> > > if we are talking two computers sitting next to each other on a switch,
> > > you'd expect those to be low ... but if you were talking about two
> > > seperate geographical locations (and yes, I realize you are adding lag to
> > > the mix with waiting for responses), you'd expect those #s to rise ...
> >
> > Which I thought was the whole point of using a group communication protocol
> > such as spread in postgresql-r. It seemed solved there...
>
> Right, but I think we want to try to do two-phase commit without spread.
> Spread seems overkill for this usage.

Out of curiosity, how does one use spread to accomplish 2PC? Isn't the
logic the Application Server would need to follow rather different with
a group communication based control than with XA / 2PC style
communication? How does one map to the other?

Re: 2-phase commit

From
Rod Taylor
Date:
> The first problem is the restart/rejoin problem.  When a 2PC member
> goes away, it is supposed to come back with all its former locks and
> everything in place, so that it can know what to do.  This is also
> extremely tricky, but I think the answer is sort of easy.  A member
> which re-joins without crashing (that is, it has open transactions,

I think you may be confusing 2PC with replication.

PostgreSQLs 2PC implementation should follow enough of the XA rules to
play nice in a mixed environment where something else is managing the
transactions (application servers are becoming more common all the
time).

As far as inter-PostgreSQL replication / queries are concerned we can
choose whatever semantics we like -- just realize that they are 2
different problems.

Re: 2-phase commit

From
Mike Mascari
Date:
Marc G. Fournier wrote:

> On Fri, 26 Sep 2003, Tom Lane wrote:
> 
>>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>
>>>Tom Lane wrote:
>>>
>>>>You're not considering the possibility of a transient communication
>>>>failure.
>>
>>>Can't the master re-send the request after a timeout?
>>
>>Not "it can", but "it has to".  The master *must* keep hold of that
>>request forever (or until the slave responds, or until we reconfigure
>>the system not to consider that slave valid anymore).  Similarly, the
>>slave cannot forget the maybe-committed transaction on pain of not being
>>a valid slave anymore.
> 
> Hrmmmm ... is there no way of having part of the protocol being a message
> sent back that its a valid/invalid slave?  ie. slave has an uncommitted
> transaction, never hears back from master to actually do the commit, so
> after x-secs * y-retries any messages it does try to send to the master
> have a bit flag set to 'invalid'?

If I understand Andrew Sullivan's request, the purpose for integration
of 2-PC into PostgreSQL, is more for distributed query than
replication via an XA interface:


http://sybooks.sybase.com/onlinebooks/group-xsarc/xsg1111e/xatuxedo/@ebt-link;pt=61?target=%25N%13_446_START_RESTART_N%25

If that is the desire (XA-compatibility) then PostgreSQL might be
talking to an Oracle database or a BEA Tuxedo TPM acting as the
coordinator. So PostgreSQL won't have an opportunity to modify the
protocol in any meaningful way if it wishes to interoperate with
XA-based transaction managers.

If it is being used only amongst other PostgreSQL backends for
replication, then why not use one of the optimistic replication protocols:

http://www.inf.ethz.ch/personal/alonso/PAPERS/commit-fast.pdf

Mike Mascari
mascarm@mascari.com




Re: 2-phase commit

From
Gavin Sherry
Date:
On Fri, 26 Sep 2003, Christopher Browne wrote:

> pgman@candle.pha.pa.us (Bruce Momjian) writes:
> > Patrick Welche wrote:
> >> On Fri, Sep 26, 2003 at 02:49:30PM -0300, Marc G. Fournier wrote:
> >> ...
> >> > if we are talking two computers sitting next to each other on a switch,
> >> > you'd expect those to be low ... but if you were talking about two
> >> > seperate geographical locations (and yes, I realize you are adding lag to
> >> > the mix with waiting for responses), you'd expect those #s to rise ...
> >>
> >> Which I thought was the whole point of using a group communication
> >> protocol such as spread in postgresql-r. It seemed solved there...
> >
> > Right, but I think we want to try to do two-phase commit without
> > spread.  Spread seems overkill for this usage.
>
> Is there some big demerit to _having_ that "overkill"?  If there is no
> major price to pay, then I don't see why it isn't reasonable to simply
> say "Sure, we'll use that!"

I recall Darren Johnson (who is working on replication with spread) saying
that it required a lot of bandwidth in real world scenarios.

Gavin


Re: 2-phase commit

From
Christopher Kings-Lynne
Date:
> Not "it can", but "it has to".  The master *must* keep hold of that
> request forever (or until the slave responds, or until we reconfigure
> the system not to consider that slave valid anymore).  Similarly, the
> slave cannot forget the maybe-committed transaction on pain of not being
> a valid slave anymore.  You can make this work, but the resource costs
> are steep.  For instance, in Postgres, you don't get to truncate the WAL
> log, for what could be a really really long time --- more disk space
> than you wanted to spend on WAL anyway.  The locks held by the
> maybe-committed transaction are another potentially unpleasant problem;
> you can't release them, no matter what else they are blocking.

So, after 'n' seconds of waiting, we abandon the slave and the slave
abandons the master.

Such a condition is probably a fairly serious failure anyway, and
something that an admin would need to expect.  The admin would also need
to expect to allocate a heap of disk space for WAL.

Chris




Re: 2-phase commit

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> ... You can make this work, but the resource costs
>> are steep.

> So, after 'n' seconds of waiting, we abandon the slave and the slave
> abandons the master.

[itch...]  But you surely cannot guarantee that the slave and the master
time out at exactly the same femtosecond.  What happens when the comm
link comes back online just when one has timed out and the other not?
(Hint: in either order, it ain't good.  Double plus ungood if, say, the
comm link manages to deliver the master's "commit confirm" message a
little bit after the master has timed out and decided to abort after all.)

In my book, timeout-based solutions to this kind of problem are certain
disasters.
        regards, tom lane


Re: 2-phase commit

From
Richard Huxton
Date:
On Saturday 27 September 2003 06:59, Tom Lane wrote:
> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> >> ... You can make this work, but the resource costs
> >> are steep.
> >
> > So, after 'n' seconds of waiting, we abandon the slave and the slave
> > abandons the master.
>
> [itch...]  But you surely cannot guarantee that the slave and the master
> time out at exactly the same femtosecond.  What happens when the comm
> link comes back online just when one has timed out and the other not?
> (Hint: in either order, it ain't good.  Double plus ungood if, say, the
> comm link manages to deliver the master's "commit confirm" message a
> little bit after the master has timed out and decided to abort after all.)
>
> In my book, timeout-based solutions to this kind of problem are certain
> disasters.

I might be (well, am actually) a bit out of my depth here, but surely what 
happens is if you have machines A,B,C and *any* of them thinks machine C has 
a problem then it does. If C can still communicate with the others then it is 
told to reinitialise/go away/start the sirens. If C can't communicate then 
it's all a bit academic.

Granted, if you have intermittent problems on a link and set your timeouts 
badly then you'll have a very brittle system, but if A thinks C has died, you 
can't just reverse that decision.

--  Richard Huxton Archonet Ltd


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Sat, 27 Sep 2003, Tom Lane wrote:

> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> >> ... You can make this work, but the resource costs
> >> are steep.
>
> > So, after 'n' seconds of waiting, we abandon the slave and the slave
> > abandons the master.
>
> [itch...]  But you surely cannot guarantee that the slave and the master
> time out at exactly the same femtosecond.  What happens when the comm
> link comes back online just when one has timed out and the other not?
> (Hint: in either order, it ain't good.

I think it was Andrew that suggested it ... when the slave timesout, it
should "trigger" a READ ONLY mode on the slave, so that when/if the master
tries to start to talk to it, it can't ...

As for the master itself, it should be smart enough that if it times out,
it knows to actually abandom the slave and not continue to try ...


Re: 2-phase commit

From
Bruce Momjian
Date:
Richard Huxton wrote:
> > [itch...]  But you surely cannot guarantee that the slave and the master
> > time out at exactly the same femtosecond.  What happens when the comm
> > link comes back online just when one has timed out and the other not?
> > (Hint: in either order, it ain't good.  Double plus ungood if, say, the
> > comm link manages to deliver the master's "commit confirm" message a
> > little bit after the master has timed out and decided to abort after all.)
> >
> > In my book, timeout-based solutions to this kind of problem are certain
> > disasters.
> 
> I might be (well, am actually) a bit out of my depth here, but surely what 
> happens is if you have machines A,B,C and *any* of them thinks machine C has 
> a problem then it does. If C can still communicate with the others then it is 
> told to reinitialise/go away/start the sirens. If C can't communicate then 
> it's all a bit academic.
> 
> Granted, if you have intermittent problems on a link and set your timeouts 
> badly then you'll have a very brittle system, but if A thinks C has died, you 
> can't just reverse that decision.

I have been thinking it might be time to start allowing external
programs to be called when certain events occur that require
administrative attention --- this would be a good case for that. 
Administrators could configure shell scripts to be run when the network
connection fails or servers drop off the network, alerting them to the
problem.  Throwing things into the server logs isn't _active_ enough.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Shridhar Daithankar
Date:
On Saturday 27 September 2003 20:17, Bruce Momjian wrote:
> Richard Huxton wrote:
> I have been thinking it might be time to start allowing external
> programs to be called when certain events occur that require
> administrative attention --- this would be a good case for that.
> Administrators could configure shell scripts to be run when the network
> connection fails or servers drop off the network, alerting them to the
> problem.  Throwing things into the server logs isn't _active_ enough.

I would say calling events from external libraries would be a good extension. 
That could allow for extending postgresql in novel way. e.g. calling a 
logrecord copy event after a WAL record is written for near real time 
replication..:-)
Shridhar



Re: 2-phase commit

From
Richard Huxton
Date:
On Saturday 27 September 2003 15:47, Bruce Momjian wrote:
> Richard Huxton wrote:
[snip]
> > I might be (well, am actually) a bit out of my depth here, but surely
> > what happens is if you have machines A,B,C and *any* of them thinks
> > machine C has a problem then it does. If C can still communicate with the
> > others then it is told to reinitialise/go away/start the sirens. If C
> > can't communicate then it's all a bit academic.
> >
[snip]
>
> I have been thinking it might be time to start allowing external
> programs to be called when certain events occur that require
> administrative attention --- this would be a good case for that.
> Administrators could configure shell scripts to be run when the network
> connection fails or servers drop off the network, alerting them to the
> problem.  Throwing things into the server logs isn't _active_ enough.

Actually, from the discussion I'd assumed there was some sort of plug-in 
"policy daemon" that was making decisions when things went wrong. Given the 
different scenarios 2 phase-commit will be used in, one size is unlikely to 
fit all.

The idea of a more general system is _very_ interesting. I know Wietse Venema 
has decided to provide an external "policy" interface for his Postfix 
mailserver, precisely because he wants to keep the core system fairly clean.
--  Richard Huxton Archonet Ltd


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Sat, 27 Sep 2003, Bruce Momjian wrote:

> I have been thinking it might be time to start allowing external
> programs to be called when certain events occur that require
> administrative attention --- this would be a good case for that.
> Administrators could configure shell scripts to be run when the network
> connection fails or servers drop off the network, alerting them to the
> problem.  Throwing things into the server logs isn't _active_ enough.

Actually, apparently you can do this now ... there is apparently a "mail
module" for PostgreSQL that you can use to have the database send email's
out ...



Re: 2-phase commit

From
Bruce Momjian
Date:
Marc G. Fournier wrote:
> 
> 
> On Sat, 27 Sep 2003, Bruce Momjian wrote:
> 
> > I have been thinking it might be time to start allowing external
> > programs to be called when certain events occur that require
> > administrative attention --- this would be a good case for that.
> > Administrators could configure shell scripts to be run when the network
> > connection fails or servers drop off the network, alerting them to the
> > problem.  Throwing things into the server logs isn't _active_ enough.
> 
> Actually, apparently you can do this now ... there is apparently a "mail
> module" for PostgreSQL that you can use to have the database send email's
> out ...

The only part that needs to be added is the ability to call an external
program when some even occurs, like a database write failure.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> You're not considering the possibility of a transient communication
> >> failure.
> 
> > Can't the master re-send the request after a timeout?
> 
> Not "it can", but "it has to". 

Why ? Mainly the coordinator(slave) not the participant(master)
has the resposibilty to resolve the in-doubt transaction.

regards,
Hiroshi Inoue



Re: 2-phase commit

From
Kevin Brown
Date:
Bruce Momjian wrote:
> Marc G. Fournier wrote:
> > 
> > 
> > On Sat, 27 Sep 2003, Bruce Momjian wrote:
> > 
> > > I have been thinking it might be time to start allowing external
> > > programs to be called when certain events occur that require
> > > administrative attention --- this would be a good case for that.
> > > Administrators could configure shell scripts to be run when the network
> > > connection fails or servers drop off the network, alerting them to the
> > > problem.  Throwing things into the server logs isn't _active_ enough.
> > 
> > Actually, apparently you can do this now ... there is apparently a "mail
> > module" for PostgreSQL that you can use to have the database send email's
> > out ...
> 
> The only part that needs to be added is the ability to call an external
> program when some even occurs, like a database write failure.

Actually, all that's really necessary is the ability to call a stored
procedure when some event occurs.  The stored procedure can take it from
there, and since it can be written in C it can do anything the postgres
user can do (for good or for ill, of course).


-- 
Kevin Brown                          kevin@sysexperts.com


Re: 2-phase commit

From
Bruce Momjian
Date:
Kevin Brown wrote:
> Bruce Momjian wrote:
> > Marc G. Fournier wrote:
> > > 
> > > 
> > > On Sat, 27 Sep 2003, Bruce Momjian wrote:
> > > 
> > > > I have been thinking it might be time to start allowing external
> > > > programs to be called when certain events occur that require
> > > > administrative attention --- this would be a good case for that.
> > > > Administrators could configure shell scripts to be run when the network
> > > > connection fails or servers drop off the network, alerting them to the
> > > > problem.  Throwing things into the server logs isn't _active_ enough.
> > > 
> > > Actually, apparently you can do this now ... there is apparently a "mail
> > > module" for PostgreSQL that you can use to have the database send email's
> > > out ...
> > 
> > The only part that needs to be added is the ability to call an external
> > program when some even occurs, like a database write failure.
> 
> Actually, all that's really necessary is the ability to call a stored
> procedure when some event occurs.  The stored procedure can take it from
> there, and since it can be written in C it can do anything the postgres
> user can do (for good or for ill, of course).

But the postmaster doesn't connect to any database, and in a serious
failure, might not be able to start one.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Kevin Brown
Date:
Bruce Momjian wrote:
> Kevin Brown wrote:
> > Actually, all that's really necessary is the ability to call a stored
> > procedure when some event occurs.  The stored procedure can take it from
> > there, and since it can be written in C it can do anything the postgres
> > user can do (for good or for ill, of course).
> 
> But the postmaster doesn't connect to any database, and in a serious
> failure, might not be able to start one.

Ah, true.  But I figured that in the context of 2PC and replication that
most of the associated failures were likely to occur in an active
backend or something equivalent, where a stored procedure was likely to
be accessible.

But yes, you certainly want to account for failures where the database
itself is unavailable.  So I guess my original comment isn't strictly
true.  :-)


-- 
Kevin Brown                          kevin@sysexperts.com


Re: 2-phase commit

From
Rod Taylor
Date:
> > Actually, all that's really necessary is the ability to call a stored
> > procedure when some event occurs.  The stored procedure can take it from
> > there, and since it can be written in C it can do anything the postgres
> > user can do (for good or for ill, of course).
>
> But the postmaster doesn't connect to any database, and in a serious
> failure, might not be able to start one.

In the event of a catastrophic, the 'nothing is running' scenario is one
standard monitoring software should pick up on that easily enough. One
that PostgreSQL cannot help with anyway (normally this is admin error).

Something simple much like pg_locks with transaction state (idle,
waiting on local lock, waiting on 3rd party, etc.), time transaction
started, time of last status change would be plenty. The monitor
software folks (Big Brother, etc. etc.) can write jobs to query those
elements and create the appropriate SNMP events when say waiting on 3rd
party for > N minutes (log at 1, trouble ticket at 2, SysAdmin page at
5, escalate to VP Pager at 20 minutes or whatever corporate policy is).

An alternative is to package an SNMP daemon (much like the stats daemon)
into the backend to generate SNMP events -- but I think this is overkill
if views are available.

Re: 2-phase commit

From
Hiroshi Inoue
Date:
Hiroshi Inoue wrote:
> 
> > -----Original Message-----
> > From: Tom Lane
> >
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Tom Lane wrote:
> > >> You're not considering the possibility of a transient communication
> > >> failure.
> >
> > > Can't the master re-send the request after a timeout?
> >
> > Not "it can", but "it has to".
> 
> Why ? Mainly the coordinator(slave) not the participant(master)
> has the resposibilty to resolve the in-doubt transaction.

As far as I see, it's the above point which prevents the
advance of this topic and the issue must be solved ASAP.

As opposed to your answer  Not "it can", but "it has to",
my answer is  Yes "it can", but "it doesn't have to".

The simplest senario(though there could be varations) is

[At participant(master)'s side] Because the commit operations is done, does nothing.

[At coordinator(slave)' side]  1) After a while  2) re-establish the communication path between the
partcipant(master)'sTM.  3) resend the "commit requeset" to the participant's TM. 1)2)3) would be repeated until the
coordinatorreceives the "commit ok" message from the partcipant.
 

If there's no objection from you, I would assume I'm right.
Please don't dodge my question this time.

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: 2-phase commit

From
"Marc G. Fournier"
Date:
On Mon, 29 Sep 2003, Hiroshi Inoue wrote:

> The simplest senario(though there could be varations) is
>
> [At participant(master)'s side]
>   Because the commit operations is done, does nothing.
>
> [At coordinator(slave)' side]
>    1) After a while
>    2) re-establish the communication path between the
>       partcipant(master)'s TM.
>    3) resend the "commit requeset" to the participant's TM.
>   1)2)3) would be repeated until the coordinator receives
>   the "commit ok" message from the partcipant.
>
> If there's no objection from you, I would assume I'm right.

'K, but what happens if the slave never gets a 'commit ok'?  Does the
slave keep trying ad nausem?


Re: 2-phase commit

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> The simplest senario(though there could be varations) is

> [At participant(master)'s side]
>   Because the commit operations is done, does nothing.

> [At coordinator(slave)' side]
>    1) After a while
>    2) re-establish the communication path between the
>       partcipant(master)'s TM.
>    3) resend the "commit requeset" to the participant's TM.
>   1)2)3) would be repeated until the coordinator receives
>   the "commit ok" message from the partcipant.

[ scratches head ] I think you are using the terms "master" and "slave"
oppositely than I would.  But in any case, this is not an answer to the
concern I had.  You're assuming that the "coordinator(slave)" side is
willing to resend a request indefinitely, and also that the
"participant(master)" side is willing to retain per-transaction commit
state indefinitely so that it can correctly answer belated questions
from the other side.  What I was complaining about was that I don't
think either side can afford to remember per-transaction state
indefinitely.  2PC in the abstract is a useless academic abstraction ---
where the rubber meets the road is defining how you cope with failures
in the commit protocol.
        regards, tom lane


Re: 2-phase commit

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > The simplest senario(though there could be varations) is
> 
> > [At participant(master)'s side]
> >   Because the commit operations is done, does nothing.
> 
> > [At coordinator(slave)' side]
> >    1) After a while
> >    2) re-establish the communication path between the
> >       partcipant(master)'s TM.
> >    3) resend the "commit requeset" to the participant's TM.
> >   1)2)3) would be repeated until the coordinator receives
> >   the "commit ok" message from the partcipant.
> 
> [ scratches head ] I think you are using the terms "master" and "slave"
> oppositely than I would.

Oops my mistake, sorry. 
But is it 2-phase commit protocol in the first place ?

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: 2-phase commit

From
Hiroshi Inoue
Date:

Hiroshi Inoue wrote:
> 
> Tom Lane wrote:
> >
> > Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > > The simplest senario(though there could be varations) is
> >
> > > [At participant(master)'s side]
> > >   Because the commit operations is done, does nothing.
> >
> > > [At coordinator(slave)' side]
> > >    1) After a while
> > >    2) re-establish the communication path between the
> > >       partcipant(master)'s TM.
> > >    3) resend the "commit requeset" to the participant's TM.
> > >   1)2)3) would be repeated until the coordinator receives
> > >   the "commit ok" message from the partcipant.
> >
> > [ scratches head ] I think you are using the terms "master" and "slave"
> > oppositely than I would.
> 
> Oops my mistake, sorry.
> But is it 2-phase commit protocol in the first place ?

That is, in your exmaple below
Example:
       Master          Slave       ------          -----       commit ready-->                       <--OK       commit
done->XX

is the "commit done" message needed ?

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: 2-phase commit

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > The simplest senario(though there could be varations) is
> 
> > [At participant(master)'s side]
> >   Because the commit operations is done, does nothing.
> 
> > [At coordinator(slave)' side]
> >    1) After a while
> >    2) re-establish the communication path between the
> >       partcipant(master)'s TM.
> >    3) resend the "commit requeset" to the participant's TM.
> >   1)2)3) would be repeated until the coordinator receives
> >   the "commit ok" message from the partcipant.
> 
> [ scratches head ] I think you are using the terms "master" and "slave"
> oppositely than I would.  But in any case, this is not an answer to the
> concern I had.  You're assuming that the "coordinator(slave)" side is
> willing to resend a request indefinitely, and also that the
> "participant(master)" side is willing to retain per-transaction commit
> state indefinitely so that it can correctly answer belated questions
> from the other side.  What I was complaining about was that I don't
> think either side can afford to remember per-transaction state
> indefinitely.

OK maybe I understand your complaint.
Basically such situation can occur when either side
is down. Especially when the coodinator(master) is down,
the particicipants are troubled. In such cases, e.g. XA
interface allows heuristic-commit on the participants.

In case one or more paricipants are down, the coordinator
may have to remember per-transaction state indefinitely.
Is it a big problem ? 

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: 2-phase commit

From
Hiroshi Inoue
Date:
I seem to have misunderstood the problem completely.
I apologize to you all(especially Tom) for disturbing
this thread.

I wonder if there might be such a nice solution when
some of the systems or communications are dead.
And as many people already mentioned, there's not so
much allowance if we only adopt XA-based protocol. 

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/

Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > The simplest senario(though there could be varations) is
> 
> > [At participant(master)'s side]
> >   Because the commit operations is done, does nothing.
> 
> > [At coordinator(slave)' side]
> >    1) After a while
> >    2) re-establish the communication path between the
> >       partcipant(master)'s TM.
> >    3) resend the "commit requeset" to the participant's TM.
> >   1)2)3) would be repeated until the coordinator receives
> >   the "commit ok" message from the partcipant.
> 
> [ scratches head ] I think you are using the terms "master" and "slave"
> oppositely than I would.  But in any case, this is not an answer to the
> concern I had.  You're assuming that the "coordinator(slave)" side is
> willing to resend a request indefinitely, and also that the
> "participant(master)" side is willing to retain per-transaction commit
> state indefinitely so that it can correctly answer belated questions
> from the other side.  What I was complaining about was that I don't
> think either side can afford to remember per-transaction state
> indefinitely.  2PC in the abstract is a useless academic abstraction ---
> where the rubber meets the road is defining how you cope with failures
> in the commit protocol.
> 
>                         regards, tom lane


Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> > > > The simplest senario(though there could be varations) is
> > >
> > > > [At participant(master)'s side]
> > > >   Because the commit operations is done, does nothing.
> > >
> > > > [At coordinator(slave)' side]
> > > >    1) After a while
> > > >    2) re-establish the communication path between the
> > > >       partcipant(master)'s TM.
> > > >    3) resend the "commit requeset" to the participant's TM.
> > > >   1)2)3) would be repeated until the coordinator receives
> > > >   the "commit ok" message from the partcipant.
> > >
> > > [ scratches head ] I think you are using the terms "master" and "slave"
> > > oppositely than I would.
> >
> > Oops my mistake, sorry.
> > But is it 2-phase commit protocol in the first place ?
>
> That is, in your exmaple below
>
>  Example:
>
>         Master          Slave
>         ------          -----
>         commit ready-->

This is the commit for phase 1. This commit is allowed to return all
sorts of errors, like violated deferred checks, out of diskspace, ...

>                         <--OK
>         commit done->XX

This is commit for phase 2, the slave *must* answer with "success"
in all but hardware failure cases. (Note that instead the master could
instead send rollback, e.g. because some other slave aborted)

> is the "commit done" message needed ?

So, yes this is needed.

Andreas


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Mon, 29 Sep 2003, Hiroshi Inoue wrote:

>
>
> Hiroshi Inoue wrote:
> >
> > Tom Lane wrote:
> > >
> > > Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > > > The simplest senario(though there could be varations) is
> > >
> > > > [At participant(master)'s side]
> > > >   Because the commit operations is done, does nothing.
> > >
> > > > [At coordinator(slave)' side]
> > > >    1) After a while
> > > >    2) re-establish the communication path between the
> > > >       partcipant(master)'s TM.
> > > >    3) resend the "commit requeset" to the participant's TM.
> > > >   1)2)3) would be repeated until the coordinator receives
> > > >   the "commit ok" message from the partcipant.
> > >
> > > [ scratches head ] I think you are using the terms "master" and "slave"
> > > oppositely than I would.
> >
> > Oops my mistake, sorry.
> > But is it 2-phase commit protocol in the first place ?
>
> That is, in your exmaple below
>
>  Example:
>
>         Master          Slave
>         ------          -----
>         commit ready-->
>                         <--OK
>         commit done->XX
>
> is the "commit done" message needed ?

Of course ... how else will the Slave commit?  From my understanding, the
concept is that the master sends a commit ready to the slave, but the OK
back is that "OK, I'm ready to commit whenever you are", at which point
the master does its commit and tells the slave to do its ...



Re: 2-phase commit

From
Bruce Momjian
Date:
Marc G. Fournier wrote:
> >         Master          Slave
> >         ------          -----
> >         commit ready-->
> >                         <--OK
> >         commit done->XX
> >
> > is the "commit done" message needed ?
> 
> Of course ... how else will the Slave commit?  From my understanding, the
> concept is that the master sends a commit ready to the slave, but the OK
> back is that "OK, I'm ready to commit whenever you are", at which point
> the master does its commit and tells the slave to do its ...

Or the slave could reject the request.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Bruce Momjian
Date:
Tom Lane wrote:
> > [At participant(master)'s side]
> >   Because the commit operations is done, does nothing.
> 
> > [At coordinator(slave)' side]
> >    1) After a while
> >    2) re-establish the communication path between the
> >       partcipant(master)'s TM.
> >    3) resend the "commit requeset" to the participant's TM.
> >   1)2)3) would be repeated until the coordinator receives
> >   the "commit ok" message from the partcipant.
> 
> [ scratches head ] I think you are using the terms "master" and "slave"
> oppositely than I would.  But in any case, this is not an answer to the
> concern I had.  You're assuming that the "coordinator(slave)" side is
> willing to resend a request indefinitely, and also that the
> "participant(master)" side is willing to retain per-transaction commit
> state indefinitely so that it can correctly answer belated questions
> from the other side.  What I was complaining about was that I don't
> think either side can afford to remember per-transaction state
> indefinitely.  2PC in the abstract is a useless academic abstraction ---
> where the rubber meets the road is defining how you cope with failures
> in the commit protocol.

I don't think there is any way to handle cases where the master or slave
just disappears.  The other machine isn't under the server's control, so
it has no way of it knowing. I think we have to allow the administrator
to set a timeout, or ask to wait indefinately, and allow them to call an
external program to record the event or notify administrators.
Multi-master replication has the same issues.

My original point was that multi-master replication has the same
limitations, but people still want it.  Same for two-phase commit --- it
has the same limitations, but people want it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Mon, 29 Sep 2003, Bruce Momjian wrote:

> Marc G. Fournier wrote:
> > >         Master          Slave
> > >         ------          -----
> > >         commit ready-->
> > >                         <--OK
> > >         commit done->XX
> > >
> > > is the "commit done" message needed ?
> >
> > Of course ... how else will the Slave commit?  From my understanding, the
> > concept is that the master sends a commit ready to the slave, but the OK
> > back is that "OK, I'm ready to commit whenever you are", at which point
> > the master does its commit and tells the slave to do its ...
>
> Or the slave could reject the request.

Huh?  The slave has that option??  In what circumstance?


Re: 2-phase commit

From
Jeff
Date:
Tom Lane wrote:

> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>>> ... You can make this work, but the resource costs
>>> are steep.
> 
>> So, after 'n' seconds of waiting, we abandon the slave and the slave
>> abandons the master.
> 
> [itch...]  But you surely cannot guarantee that the slave and the master
> time out at exactly the same femtosecond.  What happens when the comm
> link comes back online just when one has timed out and the other not?
> (Hint: in either order, it ain't good.  Double plus ungood if, say, the
> comm link manages to deliver the master's "commit confirm" message a
> little bit after the master has timed out and decided to abort after all.)
> 
> In my book, timeout-based solutions to this kind of problem are certain
> disasters.
> 
> regards, tom lane

What do commercial databases do about 2PC or other multi-master solutions?
You've done a good job of convincing me that it's unreliable no matter what
(through your posts on this topic over a long time). However, I would think
that something like Oracle or DB2 have some kind of answer for
multi-master, and I'm curious what it is. If they don't, is it reasonable
to make a test case that leaves their database inconsistent or hanging?

I can (probably) get access to a SQL Server system to run some tests, if
someone is interested.
       regards,               jeff davis





Re: 2-phase commit

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Zeugswetter Andreas SB SD [mailto:ZeugswetterA@spardat.at] 
> > 
> >  Example:
> > 
> >         Master          Slave
> >         ------          -----
> >         commit ready-->
> 
> This is the commit for phase 1. This commit is allowed to return all 
> sorts of errors, like violated deferred checks, out of diskspace, ...
> 
> >                         <--OK
> >         commit done->XX
> 
> This is commit for phase 2, the slave *must* answer with "success"
> in all but hardware failure cases. (Note that instead the 
> master could 
> instead send rollback, e.g. because some other slave aborted)
> 
> > is the "commit done" message needed ?
> 
> So, yes this is needed

Thanks.
I misunderstood that the "commit done" message is the last response from
the participant to the coordinator. I missed the "OK" message before it.
Where were my eyes ?

regards,
Hiroshi Inoue



Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> I don't think there is any way to handle cases where the master or slave
> just disappears.  The other machine isn't under the server's control, so
> it has no way of it knowing. I think we have to allow the administrator
> to set a timeout, or ask to wait indefinately, and allow them to call an
> external program to record the event or notify administrators.
> Multi-master replication has the same issues.

Needs to wait indefinitely, a timeout is not acceptable since it leads to
inconsistent data. Human (or monitoring software) intervention is needed
if they can't reach each other in a reasonable time.

I think this needs to be kept dumb. Different sorts of use cases will simply
need different answers to resolve in-doubt transactions. What is needed is an
interface that allows listing and commit/rollback of in-doubt transactions
(preferably from a newly started client, or a direct command for the postmaster).

Andreas


Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> > >         Master          Slave
> > >         ------          -----
> > >         commit ready-->
> > >                         <--OK
> > >         commit done->XX
> > >
> > > is the "commit done" message needed ?
> >
> > Of course ... how else will the Slave commit?  From my
> understanding, the
> > concept is that the master sends a commit ready to the
> slave, but the OK
> > back is that "OK, I'm ready to commit whenever you are", at
> which point
> > the master does its commit and tells the slave to do its ...
>
> Or the slave could reject the request.

At this point only because of a hardware error. In case of network
problems the "commit done" eighter did not reach the slave or the "success"
answer did not reach the master.

That is what it's all about. Phase 2 is supposed to be low overhead and very
fast to allow keeping the time window for failure (that produces in-doubt
transactions) as short as possible.

Andreas


Re: 2-phase commit

From
Bruce Momjian
Date:
Marc G. Fournier wrote:
> > > > is the "commit done" message needed ?
> > >
> > > Of course ... how else will the Slave commit?  From my understanding, the
> > > concept is that the master sends a commit ready to the slave, but the OK
> > > back is that "OK, I'm ready to commit whenever you are", at which point
> > > the master does its commit and tells the slave to do its ...
> >
> > Or the slave could reject the request.
> 
> Huh?  The slave has that option??  In what circumstance?

I thought the slave could reject if someone local already had the row
locked.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> But is it 2-phase commit protocol in the first place ?

> That is, in your exmaple below

>  Example:

>         Master          Slave
>         ------          -----
>         commit ready-->
>                         <--OK
>         commit done->XX

> is the "commit done" message needed ?

Absolutely --- otherwise, we'd not be having this whole discussion.  The
problem is that the slave is holding ready to commit but doesn't know
whether he should or not ... or alternatively, he did commit but the
master didn't get the acknowledgement.

It's not that big a deal for the master to remember past committed
transactions until it knows all slaves have acknowledged committing
them; you only need a bit or so per transaction.  It's a much bigger
deal if the slave has to hold the transaction ready-to-commit for a
long time.  That transaction is holding locks, and also the sheer
volume of log data is way bigger.  (For comparison, we recycle pg_xlog
details about a transaction much sooner than we recycle pg_clog.)

I think you really want some way for the slave to decide it can time out
and abort the transaction after all ... but I don't see how you do
that without breaking the 2PC protocol.
        regards, tom lane


Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> > > Or the slave could reject the request.
> >
> > Huh?  The slave has that option??  In what circumstance?
>
> I thought the slave could reject if someone local already had the row
> locked.

No, not at all. The slave would need to reject phase 1 "commit ready"
for this.

Andreas


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote:
> 
> I think it was Andrew that suggested it ... when the slave timesout, it
> should "trigger" a READ ONLY mode on the slave, so that when/if the master
> tries to start to talk to it, it can't ...
> 
> As for the master itself, it should be smart enough that if it times out,
> it knows to actually abandom the slave and not continue to try ...

Yes, but now we're talking as though this is master-slave
replication.  Actually, "master" and "slave" are only useful terms in
a transaction for 2PC.  So every machine is both a master and a
slave.

It seems that one way out is just to fall back to "read only" as soon
as a single failure happens.  That's the least graceful but maybe
safest approach to failure, analogous to what fsck does to your root
filesystem at boot time.  Of course, since there's no "read only"
mode at the moment, this is all pretty hand-wavy on my part :-/

A


-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Bruce Momjian
Date:
Zeugswetter Andreas SB SD wrote:
> 
> > > > Or the slave could reject the request.
> > > 
> > > Huh?  The slave has that option??  In what circumstance?
> > 
> > I thought the slave could reject if someone local already had the row
> > locked.
> 
> No, not at all. The slave would need to reject phase 1 "commit ready"
> for this.

Oh, yea, thanks.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Marc G. Fournier wrote:
> >>> Or the slave could reject the request.
> >> 
> >> Huh?  The slave has that option??  In what circumstance?
> 
> > I thought the slave could reject if someone local already had the row
> > locked.
> 
> All normal reasons for transaction failure are supposed to be checked
> for before the slave responds that it's ready to commit.  Otherwise it's
> supposed to say it can't commit.
> 
> Basically the weak spot of 2PC is that it assumes there are no possible
> reasons for failure after "ready to commit" is sent.  You can make that
> approximately true, with sufficient investment of resources, but it's
> definitely not a pleasant assumption.

Yep.  There is no full solution.  I think it is like running with fsync
off --- if the OS crashes, you have to clean up --- if you fail on a
2-phase commit, you have to clean up.  Multi-master will be the same.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Sun, Sep 28, 2003 at 11:58:24AM -0700, Kevin Brown wrote:
> > But the postmaster doesn't connect to any database, and in a serious
> > failure, might not be able to start one.
> 
> Ah, true.  But I figured that in the context of 2PC and replication that
> most of the associated failures were likely to occur in an active
> backend or something equivalent, where a stored procedure was likely to
> be accessible.

AS you go on to note, that's not always a possibility.  For instance,
server C crashes and can't come back because, say, its WAL is
scrabled.  All it will currently be able to do is scream at you in
the logs, which won't solve all the problems one has with 2PC (among
other problems).

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote:
> > 
> > I think it was Andrew that suggested it ... when the slave timesout, it
> > should "trigger" a READ ONLY mode on the slave, so that when/if the master
> > tries to start to talk to it, it can't ...
> > 
> > As for the master itself, it should be smart enough that if it times out,
> > it knows to actually abandom the slave and not continue to try ...
> 
> Yes, but now we're talking as though this is master-slave
> replication.  Actually, "master" and "slave" are only useful terms in
> a transaction for 2PC.  So every machine is both a master and a
> slave.
> 
> It seems that one way out is just to fall back to "read only" as soon
> as a single failure happens.  That's the least graceful but maybe
> safest approach to failure, analogous to what fsck does to your root
> filesystem at boot time.  Of course, since there's no "read only"
> mode at the moment, this is all pretty hand-wavy on my part :-/

Yes, but that affects all users, not just the transaction we were
working on. I think we have to get beyond the idea that this can be made
failure-proof, and just outline the behaviors for failure, and it has to
be configurable by the administrator.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Mon, Sep 29, 2003 at 11:14:30AM -0300, Marc G. Fournier wrote:
> >
> > Or the slave could reject the request.
> 
> Huh?  The slave has that option??  In what circumstance?

In every circumstance where a stand-alone machine would have it. 
Machine A may not yet know about conflicting transactions on machine
B.  This is why 2PC is hard ;-)

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Marc G. Fournier wrote:
>>> Or the slave could reject the request.
>> 
>> Huh?  The slave has that option??  In what circumstance?

> I thought the slave could reject if someone local already had the row
> locked.

All normal reasons for transaction failure are supposed to be checked
for before the slave responds that it's ready to commit.  Otherwise it's
supposed to say it can't commit.

Basically the weak spot of 2PC is that it assumes there are no possible
reasons for failure after "ready to commit" is sent.  You can make that
approximately true, with sufficient investment of resources, but it's
definitely not a pleasant assumption.
        regards, tom lane


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Sat, Sep 27, 2003 at 08:36:36AM +0000, Jeff wrote:
> 
> What do commercial databases do about 2PC or other multi-master solutions?
> You've done a good job of convincing me that it's unreliable no matter what
> (through your posts on this topic over a long time). However, I would think
> that something like Oracle or DB2 have some kind of answer for
> multi-master, and I'm curious what it is. If they don't, is it reasonable
> to make a test case that leaves their database inconsistent or hanging?

Most real replication systems are not doing 2PC.  For me, 2PC-based
replication is not real interesting anyway, because the point of
multi-master replication is often at least partly speed, and 2PC is
nothing if not a good way to make sure that every database is at
least as slow as the slowest node.

But 2PC is important for application-server-based, XA-type work, and
for heterogenous databases.  Both of those would be real nice
features to support.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Andrew Sullivan
Date:
On Fri, Sep 26, 2003 at 05:15:37PM -0400, Rod Taylor wrote:
> > The first problem is the restart/rejoin problem.  When a 2PC member
> > goes away, it is supposed to come back with all its former locks and
> > everything in place, so that it can know what to do.  This is also
> > extremely tricky, but I think the answer is sort of easy.  A member
> > which re-joins without crashing (that is, it has open transactions,
> 
> I think you may be confusing 2PC with replication.

No, I'm not.  One needs to decide how to handle the situation where a
slave database in a 2PC transaction goes away and comes back, for
whatever reasons that may happen.  Since the idea here is to come up
with ways of handling the failure of 2PC in some cases, we need
something which notices that members are not playing nice. 

> PostgreSQLs 2PC implementation should follow enough of the XA rules to
> play nice in a mixed environment where something else is managing the
> transactions (application servers are becoming more common all the
> time).

I agree.  But we still need to decide how to handle cases where
things go away, and if there are some transaction managers that don't
fit that model, then we should not accept such managers.  Of course,
what such managers do is important data in deciding what sorts of
compromises are acceptable.

A
-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Rod Taylor
Date:
> > It seems that one way out is just to fall back to "read only" as soon
> > as a single failure happens.  That's the least graceful but maybe
> > safest approach to failure, analogous to what fsck does to your root
> > filesystem at boot time.  Of course, since there's no "read only"
> > mode at the moment, this is all pretty hand-wavy on my part :-/
>
> Yes, but that affects all users, not just the transaction we were
> working on. I think we have to get beyond the idea that this can be made
> failure-proof, and just outline the behaviors for failure, and it has to
> be configurable by the administrator.

Yes, but holding locks on the affected rows IS appropriate until the
administrator issues something like:

ALTER SYSTEM ABORT GLOBAL TRANSACTION 123;

Re: 2-phase commit

From
Peter Eisentraut
Date:
Tom Lane writes:

> No.  The real problem with 2PC in my mind is that its failure modes
> occur *after* you have promised commit to one or more parties.  In
> multi-master, if you fail you know it before you have told the client
> his data is committed.

I have a book here which claims that the solution to the problems of
2-phase commit is 3-phase commit, which goes something like this:

coordinator        participant
-----------        -----------
INITIAL            INITIALprepare -->
WAIT<-- vote commit        READY
(all voted commit)prepare-to-commit -->
PRE-COMMIT<-- ready-to-commit        PRE-COMMITglobal-commit -->
COMMIT            COMMIT


If the coordinator fails and all participants are in state READY, they can
safely decide to abort after some timeout.  If some participant is already
in state PRE-COMMIT, it becomes the new coordinator and sends the
global-commit message.

Details are left as an exercise. :-)

-- 
Peter Eisentraut   peter_e@gmx.net



Re: 2-phase commit

From
Rod Taylor
Date:
> No, I'm not.  One needs to decide how to handle the situation where a
> slave database in a 2PC transaction goes away and comes back, for
> whatever reasons that may happen.  Since the idea here is to come up
> with ways of handling the failure of 2PC in some cases, we need
> something which notices that members are not playing nice.

Yes, you're right. The part about the member reinitializing lead me to
believe that you were thinking replication (read it as copying data from
source location to bring it back up to speed -- which is not what you
intended).



Re: 2-phase commit

From
Manfred Spraul
Date:
Peter Eisentraut wrote:

>Tom Lane writes:
>
>  
>
>>No.  The real problem with 2PC in my mind is that its failure modes
>>occur *after* you have promised commit to one or more parties.  In
>>multi-master, if you fail you know it before you have told the client
>>his data is committed.
>>    
>>
>
>I have a book here which claims that the solution to the problems of
>2-phase commit is 3-phase commit, which goes something like this:
>
>coordinator        participant
>-----------        -----------
>INITIAL            INITIAL
>    prepare -->
>WAIT
>    <-- vote commit
>            READY
>(all voted commit)
>    prepare-to-commit -->
>PRE-COMMIT
>    <-- ready-to-commit
>            PRE-COMMIT
>    global-commit -->
>COMMIT            COMMIT
>
>
>If the coordinator fails and all participants are in state READY, they can
>safely decide to abort after some timeout.  If some participant is already
>in state PRE-COMMIT, it becomes the new coordinator and sends the
>global-commit message.
>
>Details are left as an exercise. :-)
>  
>
Ok. Lets assume one coordinator, two partitipants.
Global commit send to both by coordinator. One replies with ok, the 
other one remains silent.
What should the coordinator do? It can't fail the transaction - the 
first partitipant has commited its part. It can't complete the 
transaction, because the ok from the 2nd partitipant is still outstanding.
I think Bruce is right: It's an admin decision. If a timeout expires, a 
user supplied app should be called, with a safe default (database 
shutdown?).

--   Manfred



Re: 2-phase commit

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> Sent: Monday, September 29, 2003 7:10 AM
> To: Marc G. Fournier
> Cc: Hiroshi Inoue; Tom Lane; 'Zeugswetter Andreas SB SD';
> 'Andrew Sullivan'; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] 2-phase commit
>
>
> Marc G. Fournier wrote:
> > >         Master          Slave
> > >         ------          -----
> > >         commit ready-->
> > >                         <--OK
> > >         commit done->XX
> > >
> > > is the "commit done" message needed ?
> >
> > Of course ... how else will the Slave commit?  From my
> understanding,
> > the concept is that the master sends a commit ready to the
> slave, but
> > the OK back is that "OK, I'm ready to commit whenever you are", at
> > which point the master does its commit and tells the slave
> to do its
> > ...
>
> Or the slave could reject the request.
>

Here is a BSD-like licensed transaction monitor:

http://tyrex.sourceforge.net/tpmonitor.html

The stuff that eventually became Tuxedo and Encina was open source from
MIT (not sure what came of it).  You used to be able to download the
source code for their transaction monitor that worked on the IBM RS/2.

This is the Transaction Internet Protocol:
http://www.ietf.org/html.charters/OLD/tip-charter.html
It should be considered very seriously as a general solution to the
problem.

I mention this, because a transaction monitor is the next logical step
in managing database activity.
Two phase commit is a subset of transaction processing.

Interesting discussion:
http://www.developer.com/db/article.php/10920_2246481_2
http://www.developer.com/java/data/article.php/10932_3066301_4

Article worth a look (win32 specific, but talks about developing a
transaction monitor):
http://archive.devx.com/free/mgznarch/vcdj/1998/octmag98/dtc1.asp

Some simple background for those who have not spent much time looking
into it:
http://www.geocities.com/rajesh_purohit/db/twophasecommit.html



Re: 2-phase commit

From
Peter Eisentraut
Date:
Manfred Spraul writes:

> Ok. Lets assume one coordinator, two partitipants.
> Global commit send to both by coordinator. One replies with ok, the
> other one remains silent.
> What should the coordinator do? It can't fail the transaction - the
> first partitipant has commited its part. It can't complete the
> transaction, because the ok from the 2nd partitipant is still outstanding.

If a participant doesn't reply in an orderly fashion (say, after timeout),
it just gets kicked out of the whole mechanism.  That isn't the
interesting part.  The interesting part is what happens when the
coordinator fails.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: 2-phase commit

From
Rod Taylor
Date:
On Mon, 2003-09-29 at 15:55, Peter Eisentraut wrote:
> Manfred Spraul writes:
>
> > Ok. Lets assume one coordinator, two partitipants.
> > Global commit send to both by coordinator. One replies with ok, the
> > other one remains silent.
> > What should the coordinator do? It can't fail the transaction - the
> > first partitipant has commited its part. It can't complete the
> > transaction, because the ok from the 2nd partitipant is still outstanding.
>
> If a participant doesn't reply in an orderly fashion (say, after timeout),
> it just gets kicked out of the whole mechanism.  That isn't the
> interesting part.  The interesting part is what happens when the
> coordinator fails.

The hot-standby coordinator picks up where the first one left off. Just
like when the participant fails the hot-standby for that participant
steps up to the plate.

For the application server side in Java, I believe the standard is OTS
(Object Transaction Service).


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Mon, Sep 29, 2003 at 12:59:55PM -0400, Bruce Momjian wrote:
> working on. I think we have to get beyond the idea that this can be made
> failure-proof, and just outline the behaviors for failure, and it has to
> be configurable by the administrator.

Exactly.  There are plenty of cases where graceless failure is
acceptable to someone as the right answer to the compromise.  Of
course, this is not to pretend they're not compromises.  There's a
world of difference between saying, "This is not safe, but if you
want to do it, here are some potential failure modes," and, "Hey, you
can use this even though it can't roll back 100% of the time, because
your application should check that."  Any comparison with any actual
application I have had to use is strictly coincidental. ;-)

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
"Dann Corbit"
Date:
Commercial systems use:

Mainframe:
CICS

UNIX:
Tuxedo
Encina

Win32:
MTS

DEC/COMPAQ/HP:
ACMS

Probably lots of others that I have never heard about.


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Mon, Sep 29, 2003 at 12:48:30PM -0400, Andrew Sullivan wrote:
> In every circumstance where a stand-alone machine would have it. 

Oops.  Wrong stage.  Never mind.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Christopher Browne
Date:
DCorbit@connx.com ("Dann Corbit") writes:
> Tuxedo

Note that this is probably the only one of the lot that is _really_
worth looking at in a serious way, as the XA standard was essentially
based on Tuxedo.  (Irrelevant Aside: BEA had releases of CICS running
on both Unix and Windows NT, so it isn't quite fair to call that
"mainframe" code...)

There might be some value in looking at how Berkeley DB supports XA,
as there actually support for using Berkeley DB as an XA resource
manager.

<http://www.sleepycat.com/docs/ref/xa/xa_intro.html>

While it would obviously be exceedingly inappropriate to copy any of
SleepyCat's software, there is some very useful background information
there on "care and feeding" which can give an idea of how a TP monitor
might be used and configured.
-- 
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


Re: 2-phase commit

From
"Dann Corbit"
Date:
A really nice overview of how various transaction managers are modeled:

http://www.ti5.tu-harburg.de/Lecture/99ws/TP/06-OverviewOfTPSystemsAndPr
oducts/sld001.htm


Re: 2-phase commit

From
Hans-Jürgen Schönig
Date:
Marc G. Fournier wrote:
> 
> On Sat, 27 Sep 2003, Bruce Momjian wrote:
> 
> 
>>I have been thinking it might be time to start allowing external
>>programs to be called when certain events occur that require
>>administrative attention --- this would be a good case for that.
>>Administrators could configure shell scripts to be run when the network
>>connection fails or servers drop off the network, alerting them to the
>>problem.  Throwing things into the server logs isn't _active_ enough.
> 
> 
> Actually, apparently you can do this now ... there is apparently a "mail
> module" for PostgreSQL that you can use to have the database send email's
> out ...
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 


I guess someting such as

CREATE TRIGGER my_trig ON BEGIN / COMMITEXECUTE ...


would be nice. I think this can be used for many perposes (not 
necessarily 2PC).
If a trigger could handle database events and not just events on tables.

ON BEGIN
ON COMMIT
ON CREATE TABLE , ...

We could have used that so often in the past in countless applications.
Regards,
    Hans


-- 
Cybertec Geschwinde u Schoenig
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/2952/30706 or +43/660/816 40 77
www.cybertec.at, www.postgresql.at, kernel.cybertec.at




Re: 2-phase commit

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Sat, Sep 27, 2003 at 09:13:27AM -0300, Marc G. Fournier wrote:
> > 
> > I think it was Andrew that suggested it ... when the slave timesout, it
> > should "trigger" a READ ONLY mode on the slave, so that when/if the master
> > tries to start to talk to it, it can't ...
> > 
> > As for the master itself, it should be smart enough that if it times out,
> > it knows to actually abandom the slave and not continue to try ...
> 
> Yes, but now we're talking as though this is master-slave
> replication.  Actually, "master" and "slave" are only useful terms in
> a transaction for 2PC.  So every machine is both a master and a
> slave.
> 
> It seems that one way out is just to fall back to "read only" as soon
> as a single failure happens.  That's the least graceful but maybe
> safest approach to failure, analogous to what fsck does to your root
> filesystem at boot time.  Of course, since there's no "read only"
> mode at the moment, this is all pretty hand-wavy on my part :-/

OK, I think we came to the conclusion that we want 2-phase commit, but
want some way to mark a server as offline/read-only, or notify an
administrator.  Can we communicate this to the Japanese guys working on
2-phase commit so they can start working toward including in 7.5?


Added to TODO:
* Add two-phase commit to all distributed transactions with  offline/readonly server status or administrator
notification         for failure
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Wed, Oct 08, 2003 at 05:43:49PM -0400, Bruce Momjian wrote:
> 
> OK, I think we came to the conclusion that we want 2-phase commit, but
> want some way to mark a server as offline/read-only, or notify an

That sounds to me like the concusion, to the extent there was one,
yes.  I'd still like to hear from those who continue to have strong
objections on the grounds of the impossibility of a guaranteed
recovery method.  Does the proposal of allowing dbas to run that
risk, provided there's a mechanism to tell them about it, satisfy the
objection (assuming, of course, 2PC can be turned off)?

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Peter Eisentraut
Date:
Andrew Sullivan writes:

> Does the proposal of allowing dbas to run that risk, provided there's a
> mechanism to tell them about it, satisfy the objection (assuming, of
> course, 2PC can be turned off)?

Why would you spent time on implementing a mechanism whose ultimate
benefit is supposed to be increasing reliability and performance, when you
already realize that it will have to lock up at the slightest sight of
trouble?  There are better mechanisms out there that you can use instead.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: 2-phase commit

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Andrew Sullivan writes:
> 
> > Does the proposal of allowing dbas to run that risk, provided there's a
> > mechanism to tell them about it, satisfy the objection (assuming, of
> > course, 2PC can be turned off)?
> 
> Why would you spent time on implementing a mechanism whose ultimate
> benefit is supposed to be increasing reliability and performance, when you
> already realize that it will have to lock up at the slightest sight of
> trouble?  There are better mechanisms out there that you can use instead.

If you want cross-server transactions, what other methods are there that
are more reliable?  It seems network unreliability is going to be a
problem no matter what method you use.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Thu, Oct 09, 2003 at 04:22:13PM +0200, Peter Eisentraut wrote:
> Why would you spent time on implementing a mechanism whose ultimate
> benefit is supposed to be increasing reliability and performance, when you
> already realize that it will have to lock up at the slightest sight of
> trouble?  There are better mechanisms out there that you can use instead.

"The slightest sign of trouble" seems to me to be overstating the
matter rather.  It cannot recover in the case where the first phase
of commit has happened everywhere, and then the master crashes.  

We are talking, after all, about a pretty exotic feature in the first
place.  I presume that anyone who is using it is also using it on
machines which have ultra-high-reliable, the cpu can catch on fire
and the box stays up sort of hardware.  I'll grant you that running a
pair of B0b'5 C0mpu73r5 Ultra kewl sooper fa5t overclocked specials
with serial ATA with the write cache enabled is a recipe for data
loss.  But that's a disaster no matter what.

But you cannot have XA-like stuff without 2PC.  You can't easily have
heterogenous systems without 2PC.  And folks have already generously
volunteered to work on this problem; I think that they deserve
support, assuming we can come up with some idea of what kinds of
compromises are acceptable ones.  There's no question that 2PC
requires some unpleasant compromises.  But if you want someone to be
able to add a Postgres member to a heterogenous cluster, you're
going to need to be able to accept some compromises, because the DBA
(or, more likely, his management) already has.

I'm not sure that 2PC is actually intended to increase reliability or
performance, by the way.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> If you want cross-server transactions, what other methods are there that
> are more reliable?

3-phase commit

-- 
Peter Eisentraut   peter_e@gmx.net



Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
> > Why would you spent time on implementing a mechanism whose ultimate
> > benefit is supposed to be increasing reliability and performance, when you
> > already realize that it will have to lock up at the slightest sight of
> > trouble?  There are better mechanisms out there that you can use instead.
>
> If you want cross-server transactions, what other methods are there that
> are more reliable?  It seems network unreliability is going to be a
> problem no matter what method you use.

And unless you have 2-phase (or 3-phase) commit, all other methods are going
to be worse, since their time window for possible critical failure is
going to be substantially larger. (extending 2-phase to 3-phase should not be
too difficult)

A lot of use cases for 2PC are not for manipulating the same data on more than
one server (replication), but different data that needs to be manipulated in an
all or nothing transaction. In this scenario it is not about reliability but about
physically locating data (e.g. in LA vs New York) where it is needed most often.

Andreas


Re: 2-phase commit

From
Mike Mascari
Date:
Bruce Momjian wrote:

> Peter Eisentraut wrote:
> 
>>Andrew Sullivan writes:
>>
>>>Does the proposal of allowing dbas to run that risk, provided there's a
>>>mechanism to tell them about it, satisfy the objection (assuming, of
>>>course, 2PC can be turned off)?
>>
>>Why would you spent time on implementing a mechanism whose ultimate
>>benefit is supposed to be increasing reliability and performance, when you
>>already realize that it will have to lock up at the slightest sight of
>>trouble?  There are better mechanisms out there that you can use instead.
> 
> If you want cross-server transactions, what other methods are there that
> are more reliable?  It seems network unreliability is going to be a
> problem no matter what method you use.

What is the stated goal of distributed transactions in PostgreSQL?

1) XA-compatibility/interoperability

or

2) Robustness in the face of network failure

The implementation choosen depends upon the answer, does it not? Is
there an implementation (e.g. 3PC) that can simulate 2PC behavior for
interoperability purposes and satisfy both requirements?

Mike Mascari
mascarm@mascari.com











Re: 2-phase commit

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Bruce Momjian writes:
> 
> > If you want cross-server transactions, what other methods are there that
> > are more reliable?
> 
> 3-phase commit

OK, how is that going to make thing safer, or does it just shrink the
failure window smaller?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Rod Taylor
Date:
On Thu, 2003-10-09 at 11:14, Peter Eisentraut wrote:
> Bruce Momjian writes:
>
> > If you want cross-server transactions, what other methods are there that
> > are more reliable?
>
> 3-phase commit

How about a real world example of a transaction manager that has
actually implemented 3PC?

But yes, the ability for the participants to talk to each-other in the
event the controller is unavailable seems an obvious fix.

Re: 2-phase commit

From
Andrew Sullivan
Date:
On Thu, Oct 09, 2003 at 11:22:05AM -0400, Mike Mascari wrote:
> The implementation choosen depends upon the answer, does it not? Is
> there an implementation (e.g. 3PC) that can simulate 2PC behavior for
> interoperability purposes and satisfy both requirements?

I don't know.  What I know is that someone showed up working on 2PC,
and got a frosty reception.  I'm trying to learn what criteria would
make the work acceptable.  For my purposes, the feature would be
really nice, so I'd hate to see the opportunity lost.  If someone has
an idea even how 3PC might be implemented, I'd be happy to hear it.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Robert Treat
Date:
On Thu, 2003-10-09 at 12:07, Andrew Sullivan wrote:
> On Thu, Oct 09, 2003 at 11:22:05AM -0400, Mike Mascari wrote:
> > The implementation choosen depends upon the answer, does it not? Is
> > there an implementation (e.g. 3PC) that can simulate 2PC behavior for
> > interoperability purposes and satisfy both requirements?
> 
> I don't know.  What I know is that someone showed up working on 2PC,
> and got a frosty reception.  I'm trying to learn what criteria would
> make the work acceptable.  For my purposes, the feature would be
> really nice, so I'd hate to see the opportunity lost.  If someone has
> an idea even how 3PC might be implemented, I'd be happy to hear it.
> 

Can you elaborate on "your purposes"?  Do they fall into the
"XA-compatibility" bit or the "Robustness in the face of network
failure"?  

On the likely chance that 50% fall into 1 and the other into 2, can we
accept a solution than doesn't address both?

Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL



Re: 2-phase commit

From
Andrew Sullivan
Date:
On Thu, Oct 09, 2003 at 02:17:28PM -0400, Robert Treat wrote:
> Can you elaborate on "your purposes"?  Do they fall into the
> "XA-compatibility" bit or the "Robustness in the face of network
> failure"?  

Yes.  I don't think that 2PC is a solution for robustness in face of
network failure.  It's too slow, to begin with.  Some sort of
multi-master system is very desirable for network failures, &c., but
I don't think anybody does active/hot standby with 2PC any more; the
performance is too bad.

I'm interested in the ability to use it for XA(ish) compatibility and
heterogenous database support.  Arguments with
people-who-think-Gartner-reports-are-good-guides-for-what-to-do would
be a lot easier if I had that, to begin with.

A 

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Tatsuo Ishii
Date:
> Yes.  I don't think that 2PC is a solution for robustness in face of
> network failure.  It's too slow, to begin with.  Some sort of
> multi-master system is very desirable for network failures, &c., but
> I don't think anybody does active/hot standby with 2PC any more; the
> performance is too bad.

I'm tired of this kind of "2PC is too slow" arguments. I think
Satoshi, the only guy who made a trial implementation of 2PC for
PostgreSQL, has already showed that 2PC is not that slow.
--
Tatsuo Ishii


Re: 2-phase commit

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> > Yes.  I don't think that 2PC is a solution for robustness in face of
> > network failure.  It's too slow, to begin with.  Some sort of
> > multi-master system is very desirable for network failures, &c., but
> > I don't think anybody does active/hot standby with 2PC any more; the
> > performance is too bad.
> 
> I'm tired of this kind of "2PC is too slow" arguments. I think
> Satoshi, the only guy who made a trial implementation of 2PC for
> PostgreSQL, has already showed that 2PC is not that slow.

Agreed.  Let's get it into 7.5 and see it in action.  If we need to
adjust it, we can, but right now, we need something for distributed
transactions, and this seems like the logical direction.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
"Marc G. Fournier"
Date:

On Fri, 10 Oct 2003, Tatsuo Ishii wrote:

> > Yes.  I don't think that 2PC is a solution for robustness in face of
> > network failure.  It's too slow, to begin with.  Some sort of
> > multi-master system is very desirable for network failures, &c., but
> > I don't think anybody does active/hot standby with 2PC any more; the
> > performance is too bad.
>
> I'm tired of this kind of "2PC is too slow" arguments. I think
> Satoshi, the only guy who made a trial implementation of 2PC for
> PostgreSQL, has already showed that 2PC is not that slow.

Where does Satoshi's implementation sit right now?  Will it patch to v7.4?
Can it provide us with a base to work from, or is it complete?



Re: 2-phase commit

From
Christopher Browne
Date:
The world rejoiced as t-ishii@sra.co.jp (Tatsuo Ishii) wrote:
> I'm tired of this kind of "2PC is too slow" arguments. I think
> Satoshi, the only guy who made a trial implementation of 2PC for
> PostgreSQL, has already showed that 2PC is not that slow.

I'm tired of it for a different reason, namely that there are "use
cases" where speed is not _relevant_.  The REAL problem that is taking
place is that people are talking past each other.

- Some say, "It's too slow; no point in doing it."
 The fact that it may be too slow _for them_ means they probably shouldn't use it.  I somehow doubt that there are
VastlyFaster alternatives waiting in the wings.
 

- The other problem that gets pointed out:  "2PC is inherently fragile, and prone to deadlock."
 Again, those that _need_ to use 2PC will forcibly need to address those concerns in the way they manage their
systems.
 Those that can't afford the fragility are not 'customers' for use of 2PC.  And, pointing back to the speed
controversy,it is not at all obvious that there is any other alternative for handling distributed processing that
_totallyaddresses_ the concerns about fragility.
 

Those that can't afford these costs associated with 2PC will simply
Not Use It.

Probably in much the same way that most people _aren't_ using
replication.  And most people _aren't_ using PL/R.  And most people
_aren't_ using any number of the contributed things.

If 2PC gets implemented, that simply means that there will be another
module that some will be interested in, and which many people won't
bother using.  Which shouldn't seem to be a particularly big deal.
-- 
"aa454","@","freenet.carleton.ca"
http://www.ntlug.org/~cbbrowne/
The way to a man's heart is with a broadsword.


Re: 2-phase commit

From
"Zeugswetter Andreas SB SD"
Date:
I was wondering whether we need to keep WAL online for 2PC,
or whether only something like clog is sufficient.

What if:1. phase 1 commit must pass the slave xid that will be used for 2nd phase   (it needs to return some sort of
identificationanyway)2. the coordinator must keep a list of slave xid's along with    corresponding (commit/rollback)
info

Is that not sufficient ? Why would WAL be needed in the first place ?
This is not replication, the slave has it's own WAL anyway.

I also don't buy the argument with the lockup. Iff today somebody connects
with psql starts a transaction modifies something and then never commits
or aborts there is also no automatism builtin that will eventually kill
it automatically. 2PC will simply need to have means for the administrator
to rollback/commit an in doubt transaction manually.

Andreas


Re: 2-phase commit

From
Andrew Sullivan
Date:
On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote:
> Satoshi, the only guy who made a trial implementation of 2PC for
> PostgreSQL, has already showed that 2PC is not that slow.

If someone has a fast implementation, so much the better.  I'm not
opposed to fast implementations! 

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Andrew Sullivan
Date:
On Thu, Oct 09, 2003 at 11:53:46PM -0400, Christopher Browne wrote:
> 
> If 2PC gets implemented, that simply means that there will be another
> module that some will be interested in, and which many people won't
> bother using.  Which shouldn't seem to be a particularly big deal.

I think the reason this is controversial, however, is that while PL/R
(e.g.) doesn't make big changes to the internals, 2PC certainly will
touch the fundamentals.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: 2-phase commit

From
Satoshi Nagayasu
Date:
Andrew Sullivan <andrew@libertyrms.info> wrote:
> On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote:
> > Satoshi, the only guy who made a trial implementation of 2PC for
> > PostgreSQL, has already showed that 2PC is not that slow.
> 
> If someone has a fast implementation, so much the better.  I'm not
> opposed to fast implementations! 

The pgbench results of my experimental 2PC implementation
and plain postgresql are available.

PostgreSQL 7.3 http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log

Experimental 2PC in PostgreSQL 7.3 http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log

I can't see a grave overhead from this comparison.

> 
> A
> 
> -- 
> ----
> Andrew Sullivan                         204-4141 Yonge Street
> Afilias Canada                        Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M2P 2A8
>                                          +1 416 646 3304 x110
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
> 


-- 
NAGAYASU Satoshi <snaga@snaga.org>



Re: 2-phase commit

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Satoshi Nagayasu [mailto:pgsql@snaga.org]
> Sent: Friday, October 10, 2003 12:26 PM
> To: Andrew Sullivan
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] 2-phase commit
>
> Andrew Sullivan <andrew@libertyrms.info> wrote:
> > On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote:
> > > Satoshi, the only guy who made a trial implementation of 2PC for
> > > PostgreSQL, has already showed that 2PC is not that slow.
> >
> > If someone has a fast implementation, so much the better.  I'm not
> > opposed to fast implementations!
>
> The pgbench results of my experimental 2PC implementation
> and plain postgresql are available.
>
> PostgreSQL 7.3
>   http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log
>
> Experimental 2PC in PostgreSQL 7.3
>   http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log
>
> I can't see a grave overhead from this comparison.

2PC is absolutely essential when you have to have both parts of the
transaction complete for a logical unit of work.  For a project that
needs it, if you don't have it you will be forced to go to another tool,
or perform lots of custom programming to work around it.

If you have 2PC and it is ten times slower than without it, you will
still need it for projects requiring that capability.

Now, a good model to start with is a very good idea.  So some discussion
and analysis is a good thing.  From the looks of it, Satoshi Nagayasu
has done a very good job.  Having a functional 2PC would be a huge
feather in the cap of PostgreSQL.

IMO-YMMV


Re: 2-phase commit

From
Christopher Browne
Date:
Martha Stewart called it a Good Thing whenDCorbit@connx.com ("Dann Corbit")wrote:
>> I can't see a grave overhead from this comparison.
>
> 2PC is absolutely essential when you have to have both parts of the
> transaction complete for a logical unit of work.  For a project that
> needs it, if you don't have it you will be forced to go to another
> tool, or perform lots of custom programming to work around it.
>
> If you have 2PC and it is ten times slower than without it, you will
> still need it for projects requiring that capability.

Just so.

I would be completely unsurprised if an attempt to use 2PC to support
generalized "multimaster replication" would involve 10-fold slowdowns
as compared to having all the activity take place on one database.

Which would imply that 2PC is not a tool that may be appropriately
used to naively do replication.  But that should not come as any grand
surprise.

To each tool the right job, and to each job the right tool...

There seems to be enough room for there to be evidence both of 2PC
being useful for improving performance, and for it to cut
performance:
- TPC benchmarks often specify the inclusion of Tuxedo as a  component; the combination of vendors would surely NOT put
it on the list if it were not an aid to performance;
 
- There is also indication that there can be a cost, notably in the  form of the concerns of deadlock, but it should
alsobe obvious  that slow network links would lead to _hideous_ increases in  latency.
 

As you say, even if there is a substantial cost, it's still worthwhile
if a project needs it.

> Now, a good model to start with is a very good idea.  So some
> discussion and analysis is a good thing.  From the looks of it,
> Satoshi Nagayasu has done a very good job.  Having a functional 2PC
> would be a huge feather in the cap of PostgreSQL.

It would seem so.  I look forward to seeing how this progresses.
-- 
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org').
http://cbbrowne.com/info/linuxdistributions.html
"XFS might  (or might not)  come out before  the year 3000.  As far as
kernel patches go,  SGI are brilliant.  As far as graphics, especially
OpenGL,  go,  SGI is  untouchable.  As  far as   filing  systems go, a
concussed doormouse in a tarpit would move faster."  -- jd on Slashdot


Re: 2-phase commit

From
"Dann Corbit"
Date:
Why not apply the effort to something already done and compatibly
licensed?

This:
http://dog.intalio.com/ots.html

Appears to be a Berkeley style licensed:
http://dog.intalio.com/license.html

Transaction monitor.

"Overview
The OpenORB Transaction Service is a very scalable transaction monitor
which also provides several extensions like XA management, a management
interface to control all transaction processes and a high reliable
recovery system.

By coordinating OpenORB and OpenORB Transaction Service, you provide a
reliable and powerful foundation for building large scalable distributed
applications.

Datasheet
The OpenORB Transaction Service is a fully compliant implementation of
the OMG Transaction Service specification.
The OpenORB Transaction Service features are :   Management of distributed transactions with a two phase commit
protocol Sub Transactions management ( nested transactions ) Propagation of the transaction context between CORBA
objectsManagement of distributed transactions propagation through databases 
with the XA protocol Automatic logs to be able to make recovery in case of failures Can be used as a transaction
initiatoror subordinate High-performance, multiple thread architecture Developed with POA Provides a management
interfaceto control all transactions Full support of JTA JDBC pooling and automatic resource enlistment  


Download
To download the OpenORB Transaction Service, do one of the following :   CVS : you can use CVS to grab the sources
directly. FTP : you get either a CVS snapshot or a prebuilt version  
To use one of these possibilities, go to the Download Services page.

ChangeLog
August 15th 2001. Version 1.2.0.   Changed the transaction client side to support late binding to the
transaction monitor. Bug fixed in the transactional client interceptor. This bug was due to
a change in the OpenORB behavior concerning the slot


To get previous change log, please refer to the CHANGELOG file available
within this service distribution."


Re: 2-phase commit

From
"Dann Corbit"
Date:
Here is a sourceforge version of the same thing
http://openorb.sourceforge.net/

> -----Original Message-----
> From: Dann Corbit
> Sent: Friday, October 10, 2003 9:38 PM
> To: Christopher Browne; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] 2-phase commit
>
>
> Why not apply the effort to something already done and
> compatibly licensed?
>
> This:
> http://dog.intalio.com/ots.html
>
> Appears to be a Berkeley style licensed:
> http://dog.intalio.com/license.html
>
> Transaction monitor.
>
> "Overview
> The OpenORB Transaction Service is a very scalable
> transaction monitor which also provides several extensions
> like XA management, a management interface to control all
> transaction processes and a high reliable recovery system.
>
> By coordinating OpenORB and OpenORB Transaction Service, you
> provide a reliable and powerful foundation for building large
> scalable distributed applications.
>
> Datasheet
> The OpenORB Transaction Service is a fully compliant
> implementation of the OMG Transaction Service specification.
> The OpenORB Transaction Service features are :
>   Management of distributed transactions with a two phase
> commit protocol
>  Sub Transactions management ( nested transactions )
>  Propagation of the transaction context between CORBA objects
>  Management of distributed transactions propagation through
> databases with the XA protocol
>  Automatic logs to be able to make recovery in case of failures
>  Can be used as a transaction initiator or subordinate
>  High-performance, multiple thread architecture
>  Developed with POA
>  Provides a management interface to control all transactions
>  Full support of JTA
>  JDBC pooling and automatic resource enlistment
>
>
> Download
> To download the OpenORB Transaction Service, do one of the
> following :
>   CVS : you can use CVS to grab the sources directly.
>  FTP : you get either a CVS snapshot or a prebuilt version
> To use one of these possibilities, go to the Download Services page.
>
> ChangeLog
> August 15th 2001. Version 1.2.0.
>   Changed the transaction client side to support late binding
> to the transaction monitor.
>  Bug fixed in the transactional client interceptor. This bug
> was due to a change in the OpenORB behavior concerning the slot
>
>
> To get previous change log, please refer to the CHANGELOG
> file available within this service distribution."
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>               http://www.postgresql.org/docs/faqs/FAQ.html


Re: 2-phase commit

From
"Jeroen T. Vermeulen"
Date:
On Fri, Oct 10, 2003 at 09:37:53PM -0700, Dann Corbit wrote:
> Why not apply the effort to something already done and compatibly
> licensed?
> 
> This:
> http://dog.intalio.com/ots.html
> 
> Appears to be a Berkeley style licensed:
> http://dog.intalio.com/license.html
> 
> Transaction monitor.

I'd say this is complementary, not an alternative to 2PC implementation
issues.  

The transaction monitor lives on the other side of the problem.  2PC is
needed in the database _so that_ the transaction monitor can do its job.

That said, having a 3-tier model is probably a good idea if distributed
transaction management is what we want.  :-)


Jeroen



Re: 2-phase commit

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Jeroen T. Vermeulen [mailto:jtv@xs4all.nl]
> Sent: Saturday, October 11, 2003 5:36 AM
> To: Dann Corbit
> Cc: Christopher Browne; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] 2-phase commit
>
>
> On Fri, Oct 10, 2003 at 09:37:53PM -0700, Dann Corbit wrote:
> > Why not apply the effort to something already done and compatibly
> > licensed?
> >
> > This:
> > http://dog.intalio.com/ots.html
> >
> > Appears to be a Berkeley style licensed:
> > http://dog.intalio.com/license.html
> >
> > Transaction monitor.
>
> I'd say this is complementary, not an alternative to 2PC
> implementation issues.

My notion is that the specification has been created that describes how
the system should operate, what the API's are, etc.  I think that most
of the work is involved in that area.  The notion is that if you program
to this spec, it will already have been well thought out and it should
be standards based when completed.
> The transaction monitor lives on the other side of the
> problem.  2PC is needed in the database _so that_ the
> transaction monitor can do its job.

Theoretically, if any database in the chain supports 2PC, you could make
all connected systems 2PC compliant by using the one functional system
as a persistent store.  But you are right.  PostgreSQL still would need
the "I promise to commit when you ask" method if it is to really support
it.

I think another way it could be handled is with nested transactions.
Just have the promise phase be an inner transaction commit but have an
outer transaction bracket that one for the actual commit.
> That said, having a 3-tier model is probably a good idea if
> distributed transaction management is what we want.  :-)

In real life, I think it is _always_ done this way.


Re: 2-phase commit

From
Rod Taylor
Date:
> I think another way it could be handled is with nested transactions.
> Just have the promise phase be an inner transaction commit but have an
> outer transaction bracket that one for the actual commit.

Not really. In the event of a crash, most 2PC systems will expect the
participant to come back in the same state it crashed in.

Our nested-transaction implementation (like our standard transaction
implementation) aborts all transactions on crash.

Re: 2-phase commit

From
Jordan Henderson
Date:
On Monday 13 October 2003 20:11, Rod Taylor wrote:
> > I think another way it could be handled is with nested transactions.
> > Just have the promise phase be an inner transaction commit but have an
> > outer transaction bracket that one for the actual commit.
>
> Not really. In the event of a crash, most 2PC systems will expect the
> participant to come back in the same state it crashed in.
>

Yes, this is correct.  There are certain phases of the protocol in which the 
transaction state must be re-instated from the log file after a crash of the 
DB server.  The re-instatement must occur prior to any connections being 
accepted by the server.  Additionally, the coordinator must be fully 
recoverable as well.  The coordinator may, depending on the phase of the 
commit/abort, contact child servers after it crashes.  The requirement is 
that during log replay, the transaction structures might have to be fully 
reconstructed and remain in-place after log replay has completed, until the 
disposition of the (sub)transaction is settled by the coordinator.  All 
dependent on the phase of course.

> Our nested-transaction implementation (like our standard transaction
> implementation) aborts all transactions on crash.

Jordan Henderson



Re: 2-phase commit

From
Jan Wieck
Date:
Bruce Momjian wrote:

> Tatsuo Ishii wrote:
>> > Yes.  I don't think that 2PC is a solution for robustness in face of
>> > network failure.  It's too slow, to begin with.  Some sort of
>> > multi-master system is very desirable for network failures, &c., but
>> > I don't think anybody does active/hot standby with 2PC any more; the
>> > performance is too bad.
>> 
>> I'm tired of this kind of "2PC is too slow" arguments. I think
>> Satoshi, the only guy who made a trial implementation of 2PC for
>> PostgreSQL, has already showed that 2PC is not that slow.
> 
> Agreed.  Let's get it into 7.5 and see it in action.  If we need to
> adjust it, we can, but right now, we need something for distributed
> transactions, and this seems like the logical direction.
> 

Are you guy's kidding or what?

2PC is not too slow in normal operations when everything is purring like 
little kittens and you're just wasting your excess bandwidth on it. The 
point is that it behaves horrible and like a dirty backstreet cat at the 
time when things go wrong ... basically it's a neat thing to have, but 
from the second you need it it becomes useless.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: 2-phase commit

From
"Peter Galbavy"
Date:
Jan Wieck wrote:
> 2PC is not too slow in normal operations when everything is purring
> like little kittens and you're just wasting your excess bandwidth on
> it. The point is that it behaves horrible and like a dirty backstreet
> cat at the time when things go wrong ... basically it's a neat thing
> to have, but from the second you need it it becomes useless.

I can't see anyone being forced to use it once it maybe/is supported. Like
many tools, "ouch!" is a good reaction when used untrained/incorrectly.

Peter



Re: 2-phase commit

From
Hans-Jürgen Schönig
Date:
>>I'm tired of this kind of "2PC is too slow" arguments. I think
>>Satoshi, the only guy who made a trial implementation of 2PC for
>>PostgreSQL, has already showed that 2PC is not that slow.
> 
> 
> Where does Satoshi's implementation sit right now?  Will it patch to v7.4?
> Can it provide us with a base to work from, or is it complete?


It is not ready yet.
You can find it at ...

http://snaga.org/pgsql/

It is based on 7.3
    * the 2-phase commit protocol (precommit and commit)    * the multi-master replication using 2PC    * distributed
transaction(distributed query)
 

current work
    * restarting (from 2nd phase) when the session is disconnected in 
2nd phase (XLOG stuffs)    * XA compliance

future work
    * hot failover and recovery in PostgreSQL cluster    * data partitioning on different servers


I have compiled it a while ago.
Seems to be pretty nice :).
Hans


-- 
Cybertec Geschwinde u Schoenig
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/2952/30706 or +43/660/816 40 77
www.cybertec.at, www.postgresql.at, kernel.cybertec.at




Re: 2-phase commit

From
Heikki Linnakangas
Date:
On Thu, 9 Oct 2003, Bruce Momjian wrote:

> Agreed.  Let's get it into 7.5 and see it in action.  If we need to
> adjust it, we can, but right now, we need something for distributed
> transactions, and this seems like the logical direction.

I've started working on two-phase commits last week, and the very
basic stuff is now working. Still a lot of bugs though.

I posted the stuff I've put together to patches-list. I'd appreciate any
comments.

- Heikki



Re: 2-phase commit

From
Hans-Jürgen Schönig
Date:
>>Why would you spent time on implementing a mechanism whose ultimate
>>benefit is supposed to be increasing reliability and performance, when you
>>already realize that it will have to lock up at the slightest sight of
>>trouble?  There are better mechanisms out there that you can use instead.
> 
> 
> If you want cross-server transactions, what other methods are there that
> are more reliable?  It seems network unreliability is going to be a
> problem no matter what method you use.
> 


I guess we need something like PITR to make this work because otherwise 
I cannot see a way to get in sync again.
Maybe I should call the desired mechanism "Entire cluster back to 
transaction X recovery".
Did anybody hear about PITR recently?

How else would you recover from any kind of problem?
No matter what you are doing network reliability will be a problem so we 
have to live with it.
Having some "going back to something consistent" is necessary anyway.
People might argue now that committed transactions might be lost. If 
people knew which ones, its ok. 90% of all people will understand that 
in case of a crash something evil might happen.
Hans

-- 
Cybertec Geschwinde u Schoenig
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/2952/30706 or +43/660/816 40 77
www.cybertec.at, www.postgresql.at, kernel.cybertec.at




Re: 2-phase commit

From
Bruce Momjian
Date:
Satoshi, can you get this ready for inclusion in 7.5?  We need a formal
proposal of how it will work from the user's perspective (new
commands?), and how it will internally work.  It seem Heikki Linnakangas
has also started working on this and perhaps he can help.

Ideally, we should have this proposal when we start 7.5 development in a
few weeks.

I know some people have concerns about 2-phase commit, from a
performance perspective and from a network failure perspective, but I
think there are enough people who want it that we should see how this
can be implemented with the proper safeguards.

---------------------------------------------------------------------------

Satoshi Nagayasu wrote:
> 
> Andrew Sullivan <andrew@libertyrms.info> wrote:
> > On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote:
> > > Satoshi, the only guy who made a trial implementation of 2PC for
> > > PostgreSQL, has already showed that 2PC is not that slow.
> > 
> > If someone has a fast implementation, so much the better.  I'm not
> > opposed to fast implementations! 
> 
> The pgbench results of my experimental 2PC implementation
> and plain postgresql are available.
> 
> PostgreSQL 7.3
>   http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log
> 
> Experimental 2PC in PostgreSQL 7.3
>   http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log
> 
> I can't see a grave overhead from this comparison.
> 
> > 
> > A
> > 
> > -- 
> > ----
> > Andrew Sullivan                         204-4141 Yonge Street
> > Afilias Canada                        Toronto, Ontario Canada
> > <andrew@libertyrms.info>                              M2P 2A8
> >                                          +1 416 646 3304 x110
> > 
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 8: explain analyze is your friend
> > 
> 
> 
> -- 
> NAGAYASU Satoshi <snaga@snaga.org>
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Satoshi Nagayasu
Date:
Bruce,

Ok, I will write my proposal.

BTW, my 2PC work is now suspended because of my master thesis.
My master thesis will (must) be finished in next few months.

To finish 2PC work, I feel 2 or 3 months are needed after that.

Bruce Momjian wrote:
> Satoshi, can you get this ready for inclusion in 7.5?  We need a formal
> proposal of how it will work from the user's perspective (new
> commands?), and how it will internally work.  It seem Heikki Linnakangas
> has also started working on this and perhaps he can help.
> 
> Ideally, we should have this proposal when we start 7.5 development in a
> few weeks.
> 
> I know some people have concerns about 2-phase commit, from a
> performance perspective and from a network failure perspective, but I
> think there are enough people who want it that we should see how this
> can be implemented with the proper safeguards.
> 
> ---------------------------------------------------------------------------
> 
> Satoshi Nagayasu wrote:
> 
>>Andrew Sullivan <andrew@libertyrms.info> wrote:
>>
>>>On Fri, Oct 10, 2003 at 09:46:35AM +0900, Tatsuo Ishii wrote:
>>>
>>>>Satoshi, the only guy who made a trial implementation of 2PC for
>>>>PostgreSQL, has already showed that 2PC is not that slow.
>>>
>>>If someone has a fast implementation, so much the better.  I'm not
>>>opposed to fast implementations! 
>>
>>The pgbench results of my experimental 2PC implementation
>>and plain postgresql are available.
>>
>>PostgreSQL 7.3
>>  http://snaga.org/pgsql/pgbench/pgbench-REL7_3.log
>>
>>Experimental 2PC in PostgreSQL 7.3
>>  http://snaga.org/pgsql/pgbench/pgbench-TPC0_0_2.log
>>
>>I can't see a grave overhead from this comparison.
>>
>>
>>>A
>>>
>>>-- 
>>>----
>>>Andrew Sullivan                         204-4141 Yonge Street
>>>Afilias Canada                        Toronto, Ontario Canada
>>><andrew@libertyrms.info>                              M2P 2A8
>>>                                         +1 416 646 3304 x110
>>>
>>>
>>>---------------------------(end of broadcast)---------------------------
>>>TIP 8: explain analyze is your friend
>>>
>>
>>
>>-- 
>>NAGAYASU Satoshi <snaga@snaga.org>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 6: Have you searched our list archives?
>>
>>               http://archives.postgresql.org
>>
> 
> 


-- 
NAGAYASU Satoshi <snaga@snaga.org>



Re: 2-phase commit

From
Bruce Momjian
Date:
Satoshi Nagayasu wrote:
> Bruce,
> 
> Ok, I will write my proposal.
> 
> BTW, my 2PC work is now suspended because of my master thesis.
> My master thesis will (must) be finished in next few months.
> 
> To finish 2PC work, I feel 2 or 3 months are needed after that.

Oh, OK, that is helpful.  Perhaps Heikki Linnakangas could help too.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 2-phase commit

From
Heikki Linnakangas
Date:
On Fri, 10 Oct 2003, Heikki Linnakangas wrote:

> On Thu, 9 Oct 2003, Bruce Momjian wrote:
>
> > Agreed.  Let's get it into 7.5 and see it in action.  If we need to
> > adjust it, we can, but right now, we need something for distributed
> > transactions, and this seems like the logical direction.
>
> I've started working on two-phase commits last week, and the very
> basic stuff is now working. Still a lot of bugs though.

I have done more work on my 2PC commit patch. I still need to work out
notifications and CREATE statements, but otherwise I'm quite happy with it
now. I received no feedback on the first version, so I'll try to clarify
how it works a bit.

The patch is against the current cvs tip. I'll post it to the
patches-list, and you can also grab it from here:
http://www.hut.fi/~hlinnaka/twophase2.diff

The patch introduces three new commands, PREPCOMMIT, COMMITPREPARED and
ABORTPREPARED.

PREPCOMMIT is called in place of COMMIT, to put the active transaction
block into prepared state. PREPCOMMIT takes a string argument that
becomes the Global Transaction Identifier (GID) for the transaction. The
GID is used as a handle to COMMITPREPARED/ABORTPREPARED commands to finish
the 2nd phase commit. After the PREPCOMMIT command finishes, the
transaction is no longer associated with any specific backend.

COMMITPREPARED/ABORTPREPARED commands are used to finish the prepared
transaction. They can be issued from any backend.

There's also a new system view, pg_prepared_xacts that show all prepared
transactions.

Here's a little step-by-step tutorial to trying out the patch:
---------
1. apply patch, patch -p0 < twophase2.diff
2. compile
3. create a new database system with initdb.
4. run postmaster
5. psql template1
6. CREATE TABLE foobar (a integer);
7. INSERT INTO foobar values (1);

8. BEGIN; UPDATE foobar SET a = 2 WHERE a = 1;
9. SELECT * FROM foobar;
10. PREPCOMMIT 'foobar_update1';

The transaction is now in prepared state, and it's no longer associated
with this backend, as you can see by issuing:

11. SELECT * FROM foobar;
12. SELECT * FROM pg_prepared_xacts;

Let's commit it then.

13. COMMITPREPARED 'foobar_update1';
14. SELECT * FROM pg_prepared_xacts;
15. SELECT * FROM foobar;

Next repeat steps 8-15 but try killing postmaster somewhere after step 9,
and observe that the transaction is not lost. Also try doing another
update with a different backend, and see that the locks held by the
prepared transaction survive the crash.
--------

I also took a look at Satoshis patches. The main difference is that
his implementation made modifications to the BE/FE protocol, while my
implementation works at the statement level. His patches don't handle
shutdowns or broken connections yet, but that was on his TODO list.

When I started working on 2PC, I didn't know about Satoshis patches,
otherwise I probably would have took them as a starting point.

The next step is going to be writing 2PC support to the JDBC driver using
the new backend commands. XA interface would be very nice too, but I'm
personally not that interested in that. Any volunteers?

Please comment! I'd like to know what you guys think about this. Am I
heading into the right direction?

Some people have expressed concerns about performance issues with 2PC in
general. Please note that this patch doesn't change the traditional
commit routines, so it won't affect you performance if you don't use 2PC.

- Heikki



Re: 2-phase commit

From
"Rob Butler"
Date:
Of course I have no time to work on it : (, but in my opinion XA interface
and support for the JDBC driver is absolutely necessary.  I think that 2pc
will generally be used more for supporting 2pc transactions between the DB
and JMS than it would be for 2pc across 2 db's.

Glad to see some progress on 2PC with Postgres though.

Later
Rob

>
> The next step is going to be writing 2PC support to the JDBC driver using
> the new backend commands. XA interface would be very nice too, but I'm
> personally not that interested in that. Any volunteers?
>
> Please comment! I'd like to know what you guys think about this. Am I
> heading into the right direction?
>