Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
Date | |
Msg-id | Yz3wUxW2a3raVbfJ@momjian.us Whole thread Raw |
In response to | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
|
List | pgsql-hackers |
On Sat, Oct 1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote: > > I have always felt this has to be done at the server level, meaning when > > a synchronous_standby_names replica is not responding after a certain > > timeout, the administrator must be notified by calling a shell command > > defined in a GUC and all sessions will ignore the replica. This gives a ------------------------------------ > > much more predictable and useful behavior than the one in the patch --- > > we have discussed this approach many times on the email lists. > > IIUC, each walsender serving a sync standby will determine that the > sync standby isn't responding for a configurable amount of time (less > than wal_sender_timeout) and calls shell command to notify the admin > if there are any backends waiting for sync replication in > SyncRepWaitForLSN(). The shell command then provides the unresponsive > sync standby name at the bare minimum for the admin to ignore it as > sync standby/remove it from synchronous_standby_names to continue > further. This still requires manual intervention which is a problem if > running postgres server instances at scale. Also, having a new shell As I highlighted above, by default you notify the administrator that a sychronous replica is not responding and then ignore it. If it becomes responsive again, you notify the administrator again and add it back as a sychronous replica. > command in any form may pose security risks. I'm not sure at this > point how this new timeout is going to work alongside > wal_sender_timeout. We have archive_command, so I don't see a problem with another shell command. > I'm thinking about the possible options that an admin has to get out > of this situation: > 1) Removing the standby from synchronous_standby_names. Yes, see above. We might need a read-only GUC that reports which sychronous replicas are active. As you can see, there is a lot of API design required here, but this is the most effective approach. > 2) Fixing the sync standby, by restarting or restoring the lost part > (such as network or some other). > > (1) is something that postgres can help admins get out of the problem > easily and automatically without any intervention. (2) is something > postgres can't do much about. > > How about we let postgres automatically remove an unresponsive (for a > pre-configured time) sync standby from synchronous_standby_names and > inform the user (via log message and via new walsender property and > pg_stat_replication for monitoring purposes)? The users can then > detect such standbys and later try to bring them back to the sync > standbys group or do other things. I believe that a production level > postgres HA with sync standbys will have monitoring to detect the > replication lag, failover decision etc via monitoring > pg_stat_replication. With this approach, a bit more monitoring is > needed. This solution requires less or no manual intervention and > scales well. Please note that I haven't studied the possibilities of > implementing it yet. > > Thoughts? Yes, see above. > > Once we have that, we can consider removing the cancel ability while > > waiting for synchronous replicas (since we have the timeout) or make it > > optional. We can also consider how do notify the administrator during > > query cancel (if we allow it), backend abrupt exit/crash, and > > Yeah. If we have the > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution, > the users can then choose to disable processing query cancels/proc > dies while waiting for sync replication in SyncRepWaitForLSN(). Yes. We might also change things so a query cancel that happens during sychronous replica waiting can only be done by an administrator, not the session owner. Again, lots of design needed here. > > if we > > should allow users to specify a retry interval to resynchronize the > > synchronous replicas. > > This is another interesting thing to consider if we were to make the > auto-removed (by the above approach) standby a sync standby again > without manual intervention. Yes, see above. You are addressing the right questions here. :-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
pgsql-hackers by date: