Re: Standalone synchronous master - Mailing list pgsql-hackers

From Alexander Björnhagen
Subject Re: Standalone synchronous master
Date
Msg-id CAO-C5=nc5VCyTFzxorpmuWX2jLEqK+FEgTbWcMqUkqLJK04wig@mail.gmail.com
Whole thread Raw
In response to Re: Standalone synchronous master  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Standalone synchronous master
List pgsql-hackers
Hello and thank you for your feedback I appreciate it.

Updated patch : sync-standalone-v2.patch

I am sorry to be spamming the list but I just cleaned it up a little
bit, wrote better comments and tried to move most of the logic into
syncrep.c since that's where it belongs anyway and also fixed a small
bug where standalone mode was disabled/enabled runtime via SIGHUP.

> Basically I like this whole idea, but I'd like to know why do you think this functionality is required?

How should a synchronous master handle the situation where all
standbys have failed ?

Well, I think this is one of those cases where you could argue either
way. Someone caring more about high availability of the system will
want to let the master continue and just raise an alert to the
operators. Someone looking for an absolute guarantee of data
replication will say otherwise.

I don’t like introducing config variables just for the fun of it, but
I think in this case there is no right and wrong.

Oracle dataguard replication has three different configurable modes
called “performance/availability/protection” which for postgres
corresponds exactly with “async/sync+standalone/sync”.

> When is the replication mode switched from "standalone" to "sync"?

Good question. Currently that happens when a standby server has
connected and also been deemed suitable for synchronous commit by the
master ( meaning that its name matches the config variable
synchronous_standby_names ). So in a setup with both synchronous and
asynchronous standbys, the master only considers the synchronous ones
when deciding on standalone mode. The asynchronous standbys are
“useless” to a synchronous master anyway.

> The former might block the transactions for a long time until the standby has caught up with the master even though
synchronous_standalone_masteris enabled and a user wants to avoid such a downtime. 

If we a talking about a network “glitch”, than the standby would take
a few seconds/minutes to catch up (not hours!) which is acceptable if
you ask me.

If we are talking about say a node failure, I suppose the workaround
even on current code is to bring up the new standby first as
asynchronous and then simply switch it to synchronous by editing
synchronous_standby_names on the master. Voila ! :)

So in effect this is a non-issue since there is a possible workaround, agree ?

> 1. While synchronous replication is running normally, replication
> connection is closed because of
>    network outage.
> 2. The master works standalone because of
> synchronous_standalone_master=on and some
>    new transactions are committed though their WAL records are not
> replicated to the standby.
> 3. The master crashes for some reasons, the clusterware detects it and
> triggers a failover.
> 4. The standby which doesn't have recent committed transactions
becomes the master at a failover...

> Is this scenario acceptable?

So you have two separate failures in less time than an admin would
have time to react and manually bring up a new standby.

I’d argue that your system in not designed to be redundant enough if
that kind of scenario worries you. But the point where it all goes
wrong is where the ”clusterware” decides to fail over automatically.
In that kind of setup synchronous_standalone_master must likely be off
but again if the “clusterware” is not smart enough to take the right
decision then it should not act at all. Better to just log critical
alerts, send sms to people etc.

Makes sense ? :)

Cheers,

/A

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Standalone synchronous master
Next
From: Magnus Hagander
Date:
Subject: Re: Standalone synchronous master