Home > mailing lists

Re: Sync Rep v19 - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Re: Sync Rep v19
Date	March 5, 2011 12:54:10
Msg-id	1299344000.10703.13319.camel@ebony Whole thread Raw
In response to	Re: Sync Rep v19 (Fujii Masao <masao.fujii@gmail.com>)
Responses	Re: Sync Rep v19 Re: Sync Rep v19
List	pgsql-hackers

Tree view

On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:
> On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Yes, that can happen. As people will no doubt observe, this seems to be
> > an argument for wait-forever. What we actually need is a wait that lasts
> > longer than it takes for us to decide to failover, if the standby is
> > actually up and this is some kind of split brain situation. That way the
> > clients are still waiting when failover occurs. WAL is missing, but
> > since we didn't acknowledge the client we are OK to treat that situation
> > as if it were an abort.
>
> Oracle Data Guard in the maximum availability mode behaves that way?
>
> I'm sure that you are implementing something like the maximum availability
> mode rather than the maximum protection one. So I'd like to know how
> the data loss situation I described can be avoided in the maximum availability
> mode.

It can't. (Oracle or otherwise...)

Once we begin waiting for sync rep, if the transaction or backend ends
then other backends will be able to see the changed data. The only way
to prevent that is to shutdown the database to ensure that no readers or
writers have access to that.

Oracle's protection mechanism is to shutdown the primary if there is no
sync standby available. Maximum Protection. Any other mode must
therefore be less than maximum protection, according to Oracle, and me.
"Available" here means one that has not timed out, via parameter.

Shutting down the main server is cool, as long as you failover to one of
the standbys. If there aren't any standbys, or you don't have a
mechanism for switching quickly, you have availability problems.

What shutting down the server doesn't do is keep the data safe for
transactions that were in their commit-wait phase when the disconnect
occurs. That data exists, yet will not have been transferred to the
standby.

>From now, I also say we should wait forever. It is the safest mode and I
want no argument about whether sync rep is safe or not. We can introduce
a more relaxed mode later with high availability for the primary. That
is possible and in some cases desirable.

Now, when we lose last sync standby we have three choices:

 1. reconnect the standby, or wait for a potential standby to catchup

 2. immediate shutdown of master and failover to one of the standbys

 3. reclassify an async standby as a sync standby

More than likely we would attempt to do (1) for a while, then do (2)

This means that when we startup the primary will freeze for a while
until the sync standbys connect. Which is OK, since they try to
reconnect every 5 seconds and we don't plan on shutting down the primary
much anyway.

I've removed the timeout parameter, plus if we begin waiting we wait
until released, forever if that's how long it takes.

The recommendation to use more than one standby remains.

Fast shutdown will wake backends from their latch and there isn't any
changed interrupt behaviour any more.

synchronous_standby_names = '*' is no longer the default

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

--
 Simon Riggs           http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services

Attachment

pgsql-hackers by date:

From: Simon Riggs
Date: 05 March 2011, 12:51:58
Subject: Re: Sync Rep v19

From: Simon Riggs
Date: 05 March 2011, 12:56:55
Subject: Re: Sync Rep v19

Re: Sync Rep v19 - Mailing list pgsql-hackers

Attachment

Previous

Next