Re: Standalone synchronous master - Mailing list pgsql-hackers
From | Alexander Björnhagen |
---|---|
Subject | Re: Standalone synchronous master |
Date | |
Msg-id | CAO-C5=nc5VCyTFzxorpmuWX2jLEqK+FEgTbWcMqUkqLJK04wig@mail.gmail.com Whole thread Raw |
In response to | Re: Standalone synchronous master (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Standalone synchronous master
|
List | pgsql-hackers |
Hello and thank you for your feedback I appreciate it. Updated patch : sync-standalone-v2.patch I am sorry to be spamming the list but I just cleaned it up a little bit, wrote better comments and tried to move most of the logic into syncrep.c since that's where it belongs anyway and also fixed a small bug where standalone mode was disabled/enabled runtime via SIGHUP. > Basically I like this whole idea, but I'd like to know why do you think this functionality is required? How should a synchronous master handle the situation where all standbys have failed ? Well, I think this is one of those cases where you could argue either way. Someone caring more about high availability of the system will want to let the master continue and just raise an alert to the operators. Someone looking for an absolute guarantee of data replication will say otherwise. I don’t like introducing config variables just for the fun of it, but I think in this case there is no right and wrong. Oracle dataguard replication has three different configurable modes called “performance/availability/protection” which for postgres corresponds exactly with “async/sync+standalone/sync”. > When is the replication mode switched from "standalone" to "sync"? Good question. Currently that happens when a standby server has connected and also been deemed suitable for synchronous commit by the master ( meaning that its name matches the config variable synchronous_standby_names ). So in a setup with both synchronous and asynchronous standbys, the master only considers the synchronous ones when deciding on standalone mode. The asynchronous standbys are “useless” to a synchronous master anyway. > The former might block the transactions for a long time until the standby has caught up with the master even though synchronous_standalone_masteris enabled and a user wants to avoid such a downtime. If we a talking about a network “glitch”, than the standby would take a few seconds/minutes to catch up (not hours!) which is acceptable if you ask me. If we are talking about say a node failure, I suppose the workaround even on current code is to bring up the new standby first as asynchronous and then simply switch it to synchronous by editing synchronous_standby_names on the master. Voila ! :) So in effect this is a non-issue since there is a possible workaround, agree ? > 1. While synchronous replication is running normally, replication > connection is closed because of > network outage. > 2. The master works standalone because of > synchronous_standalone_master=on and some > new transactions are committed though their WAL records are not > replicated to the standby. > 3. The master crashes for some reasons, the clusterware detects it and > triggers a failover. > 4. The standby which doesn't have recent committed transactions becomes the master at a failover... > Is this scenario acceptable? So you have two separate failures in less time than an admin would have time to react and manually bring up a new standby. I’d argue that your system in not designed to be redundant enough if that kind of scenario worries you. But the point where it all goes wrong is where the ”clusterware” decides to fail over automatically. In that kind of setup synchronous_standalone_master must likely be off but again if the “clusterware” is not smart enough to take the right decision then it should not act at all. Better to just log critical alerts, send sms to people etc. Makes sense ? :) Cheers, /A
Attachment
pgsql-hackers by date: