Re: 2-phase commit - Mailing list pgsql-hackers

From Andrew Sullivan
Subject Re: 2-phase commit
Date
Msg-id 20030926194018.GB18244@libertyrms.info
Whole thread Raw
In response to Re: 2-phase commit  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 2-phase commit
List pgsql-hackers
On Fri, Sep 26, 2003 at 01:34:28PM -0400, Tom Lane wrote:
> 
> Example:
> 
>     Master        Slave
>     ------        -----
>     commit ready-->
>             <--OK
>     commit done->XX

> maybe he didn't.  Both sides are forced to keep information about the 
> open transaction indefinitely.  Timing out on either side could yield
> the wrong result.

If i understand the complaints, I think there are two big issues.

The first problem is the restart/rejoin problem.  When a 2PC member
goes away, it is supposed to come back with all its former locks and
everything in place, so that it can know what to do.  This is also
extremely tricky, but I think the answer is sort of easy.  A member
which re-joins without crashing (that is, it has open transactions,
&c.), it just has to complete its transactions with the other
member(s).  If other members have processed new transactions since
the member left, the member is kicked out.  It's not allowed to join
without being re-initialised.  A member which crashes is just a
special case of this.  This is not elegant, not nice, &c.  But I
don't think anyone can really guarantee that a crahsed member will
start up correctly (it crashed, after all; maybe there's a bug).  So
this is the safest approach, and I don't think it's a big deal.  It's
not cheap, of course, and there may be problems arising from the
conditions I describe below.  But I think they can be handled (see
the section on "compromises", below) intelligently.

The second, stickier problem is just as Tom describes.  When the
master is "Commit done" and that message doesn't make it to the other
host(s), you might have to wait forever.  Of course, that's not
acceptable.

But I can think of some options of how to decide to handle this. 
Note that these may not guarantee no loss of data.  That's not a
compromise one is usually willing to make; but just because I don't
want to accept that compromise doesn't mean it is unacceptable to
everyone.

Some possible compromises
=========================

1,    One machine always wins.  One could decide to pick one
machine that, in case of some sort of failure, always wins.  You need
some sort of heartbeat system which checks for the other member(s) of
the cluster.  In the event of failure, whatever is on the "winner"
machine is deemed to be correct, and everyone else has to lose.  If
the point of your 2PC is to provide synchronous access to high loads
of read-only clients, this would probably be a good solution, since
only one machine would ever see data changes.

2.    Quorum rule.  One could decide on a quorum of machines, and
the group which has quorum wins.  (Naturally, this has to be an
absolute majority.)  The quorum can continue to process queries, and
the folks who left the room have to re-sync to join.

3.    Fail to read-only status and let the DBA sort it out.  

4.    Mark the contentious rows as "bad" and let the DBA sort it
out.  This option is not dissimilar to what Access/SQL server
disconnected multi-master replication does.  It's not elegant, but it
might be a good answer for the cases where 2PC gets used.

Note that none of these can guarantee that some apparently committed
data will not later be lost.  To real database hounds, that will
sound like apostasy, but I suspect it is the sort of trade-off that
real products make all the time.  You have to have a way of
collecting the "yeah, we told you it was committed, but we lied" data
and being able to track it; and that has to be enough.  The real
security-of-data work is going to have to be done by ultra-reliable
hardware, good maintenance practices, &c.  Then when losses are down
in the .001% range from this sort of mistake, no one will care.

This is not, by the way, the fully-formed set of suggestions I said
I'd deliver when I started the thread; but since it came up again
today, I thought I'd respond with what I had so far.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: initdb failure (was Re: [GENERAL] sequence's plpgsql)
Next
From: Bruce Momjian
Date:
Subject: Re: initdb failure (was Re: [GENERAL] sequence's plpgsql)