Re: Issues with Quorum Commit - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Issues with Quorum Commit
Date
Msg-id 1286570847.2304.1015.camel@ebony
Whole thread Raw
In response to Re: Issues with Quorum Commit  (Markus Wanner <markus@bluegap.ch>)
List pgsql-hackers
On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote:
> Well, full cluster outages are infrequent, but sadly cannot be avoided
> entirely. (Murphy's laughing). IMO we should be prepared to deal with
> those. 

I've described how I propose to deal with those. I'm not waving away
these issues, just proposing that we consciously choose simplicity and
therefore robustness.

Let me say it again for clarity. (This is written for the general case,
though my patch uses only k=1 i.e. one acknowledgement):

If we want robustness, we have multiple standbys. So if you lose one,
you continue as normal without interruption. That is the first and most
important line of defence - not software.

When we start to wait, if there aren't sufficient active standbys to
acknowledge a commit, then the commit won't wait. This behaviour helps
us avoid situations where we are hours or days away from having a
working standby to acknowledge the commit. We've had a long debate about
servers that "ought to be there" but aren't; I suggest we treat standbys
that aren't there as having a strong possibility they won't come back,
and hence not worth waiting for. Heikki disagrees; I have no problem
with adding server registration so that we can add additional waits, but
I doubt that the majority of users prefer waiting over availability. It
can be an option

Once we are waiting, if insufficient standbys acknowledge the commit we
will wait until the timeout expires, after which we commit and continue
working. If you don't like timeouts, set the timeout to 0 to wait
forever. This behaviour is designed to emphasise availability. (I
acknowledge that some people are so worried by data loss that they would
choose to stop changes altogether, and accept unavailability; I regard
that as a minority use case, but one which I would not argue against
including as an options at some point in the future.)

To cover Dimitri's observation that when a streaming standby first
connects it might take some time before it can sensibly acknowledge, we
don't activate the standby until it has caught up. Once caught up, it
will advertise it's capability to offer a sync rep service. Standbys
that don't wish to be failover targets can set
synchronous_replication_service = off.

The paths between servers aren't defined explicitly, so the parameters
all still work even after failover.
-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Issues with Quorum Commit
Next
From: Simon Riggs
Date:
Subject: Re: Issues with Quorum Commit