On Sat, 2011-01-01 at 18:49 +0100, Stefan Kaltenbrunner wrote:
> hmm maybe my "surviving" standbys(the case I'm wondering about is
> whole
> datacenter failures which might take out more than just the master)
> was
> not clear - consider three boxes, one master and two standby and
> semisync replication(ie any one of the standbys is enough to reply).
>
> 1. master fsyncs wal
> 2. standby #1 fsyncs and replies
> 3. master confirms commit
> 4. desaster strikes and destroys master and standby #1 while standby
> m2
> never had time to apply the change(IO/CPU load, latency, whatever)
> 5. now you have a sync standby that is missing something that was
> commited on the master and confirmed to the client and no way to
> verify
> that this thing happened (same problem with more than two standbys -
> as
> long as you lose ONE standby and the master at the same time you will
> never be sure)
This is obvious misconfiguration that anybody with HA experience would
spot immediately. If you have local standbys then you should mark them
as not available for sync rep, as described in the docs I've written.
> what is it that I'm missing here?
The fact that we've discussed this already and agreed to do 9.1 with
quorum_commit = 1. I proposed making this a parameter; other solutions
were also proposed, but it was considered too complex for this release.
This is a trade-off between availability and data guarantees.
MySQL and Oracle "suffer" from exactly this "problem". DB2 supports only
one master and SQLServer doesn't have sync rep at all.
-- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services