Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
Date | |
Msg-id | CALj2ACVuivbfS4zNGKPRE=rmme2VxC9obBjoOdrw5k+JKVk3UA@mail.gmail.com Whole thread Raw |
In response to | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
|
List | pgsql-hackers |
On Fri, Sep 30, 2022 at 4:23 AM Bruce Momjian <bruce@momjian.us> wrote: > > I don't think this patch is going in the right direction, and I think we > need to step back to see why. Thanks a lot Bruce for providing insights. > First, let's see how synchronous replication works. Each backend waits > for one or more synchronous replicas to acknowledge the WAL related to > its commit and then it marks the commit as done in PGPROC and then to > the client; I wrote a blog about it: > > https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020 Great! > So, what happens when an insufficient number of synchronous replicas > reply? Sessions hang because the synchronous behavior cannot be > guaranteed. We then _allow_ query cancel so the user or administrator > can get out of the hung sessions and perhaps modify > synchronous_standby_names. Right. > What the proposed patch effectively does is to prevent the ability to > recovery the hung sessions or auto-continue the sessions if an > insufficient number of synchronous replicas respond. So, in a way, it > is allowing for more strict and less strict behavior of > synchronous_standby_names. Yes. I do agree that it makes the sessions further wait and closes the quick exit path that admins/users have when the problem occurs. But it disallows users cancelling queries or terminating backends only to end up in locally-committed-but-not-replicated transactions on a server setup where sync standbys all are working fine. I agree that we don't want to further wait on cancels or proc dies as it makes the system more unresponsive [1]. > However, I think trying to solve this at the session level is the wrong > approach. If you set a timeout to continue stuck sessions, odds are the > timeout will be too long for each session and performance will be > unacceptable anyway, so you haven't gained much. If you prevent > cancels, you effectively lock up the system with fewer options of > recovery. Yes. > I have always felt this has to be done at the server level, meaning when > a synchronous_standby_names replica is not responding after a certain > timeout, the administrator must be notified by calling a shell command > defined in a GUC and all sessions will ignore the replica. This gives a > much more predictable and useful behavior than the one in the patch --- > we have discussed this approach many times on the email lists. IIUC, each walsender serving a sync standby will determine that the sync standby isn't responding for a configurable amount of time (less than wal_sender_timeout) and calls shell command to notify the admin if there are any backends waiting for sync replication in SyncRepWaitForLSN(). The shell command then provides the unresponsive sync standby name at the bare minimum for the admin to ignore it as sync standby/remove it from synchronous_standby_names to continue further. This still requires manual intervention which is a problem if running postgres server instances at scale. Also, having a new shell command in any form may pose security risks. I'm not sure at this point how this new timeout is going to work alongside wal_sender_timeout. I'm thinking about the possible options that an admin has to get out of this situation: 1) Removing the standby from synchronous_standby_names. 2) Fixing the sync standby, by restarting or restoring the lost part (such as network or some other). (1) is something that postgres can help admins get out of the problem easily and automatically without any intervention. (2) is something postgres can't do much about. How about we let postgres automatically remove an unresponsive (for a pre-configured time) sync standby from synchronous_standby_names and inform the user (via log message and via new walsender property and pg_stat_replication for monitoring purposes)? The users can then detect such standbys and later try to bring them back to the sync standbys group or do other things. I believe that a production level postgres HA with sync standbys will have monitoring to detect the replication lag, failover decision etc via monitoring pg_stat_replication. With this approach, a bit more monitoring is needed. This solution requires less or no manual intervention and scales well. Please note that I haven't studied the possibilities of implementing it yet. Thoughts? > Once we have that, we can consider removing the cancel ability while > waiting for synchronous replicas (since we have the timeout) or make it > optional. We can also consider how do notify the administrator during > query cancel (if we allow it), backend abrupt exit/crash, and Yeah. If we have the timeout-and-auto-removal-of-standby-from-sync-standbys-list solution, the users can then choose to disable processing query cancels/proc dies while waiting for sync replication in SyncRepWaitForLSN(). > if we > should allow users to specify a retry interval to resynchronize the > synchronous replicas. This is another interesting thing to consider if we were to make the auto-removed (by the above approach) standby a sync standby again without manual intervention. Thoughts? [1] https://www.postgresql.org/message-id/CA%2BTgmoaCBwgMDkeBDOgtPgHcbfSYq%2BzORjL5DoU3pJyjALxtoQ%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: