Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date
Msg-id Yz3wUxW2a3raVbfJ@momjian.us
Whole thread Raw
In response to Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Sat, Oct  1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote:
> > I have always felt this has to be done at the server level, meaning when
> > a synchronous_standby_names replica is not responding after a certain
> > timeout, the administrator must be notified by calling a shell command
> > defined in a GUC and all sessions will ignore the replica.  This gives a
                         ------------------------------------
> > much more predictable and useful behavior than the one in the patch ---
> > we have discussed this approach many times on the email lists.
> 
> IIUC, each walsender serving a sync standby will determine that the
> sync standby isn't responding for a configurable amount of time (less
> than wal_sender_timeout) and calls shell command to notify the admin
> if there are any backends waiting for sync replication in
> SyncRepWaitForLSN(). The shell command then provides the unresponsive
> sync standby name at the bare minimum for the admin to ignore it as
> sync standby/remove it from synchronous_standby_names to continue
> further. This still requires manual intervention which is a problem if
> running postgres server instances at scale. Also, having a new shell

As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it.  If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.

> command in any form may pose security risks. I'm not sure at this
> point how this new timeout is going to work alongside
> wal_sender_timeout.

We have archive_command, so I don't see a problem with another shell
command.

> I'm thinking about the possible options that an admin has to get out
> of this situation:
> 1) Removing the standby from synchronous_standby_names.

Yes, see above.  We might need a read-only GUC that reports which
sychronous replicas are active.  As you can see, there is a lot of API
design required here, but this is the most effective approach.

> 2) Fixing the sync standby, by restarting or restoring the lost part
> (such as network or some other).
> 
> (1) is something that postgres can help admins get out of the problem
> easily and automatically without any intervention. (2) is something
> postgres can't do much about.
> 
> How about we let postgres automatically remove an unresponsive (for a
> pre-configured time) sync standby from synchronous_standby_names and
> inform the user (via log message and via new walsender property and
> pg_stat_replication for monitoring purposes)? The users can then
> detect such standbys and later try to bring them back to the sync
> standbys group or do other things. I believe that a production level
> postgres HA with sync standbys will have monitoring to detect the
> replication lag, failover decision etc via monitoring
> pg_stat_replication. With this approach, a bit more monitoring is
> needed. This solution requires less or no manual intervention and
> scales well. Please note that I haven't studied the possibilities of
> implementing it yet.
> 
> Thoughts?

Yes, see above.

> > Once we have that, we can consider removing the cancel ability while
> > waiting for synchronous replicas (since we have the timeout) or make it
> > optional.  We can also consider how do notify the administrator during
> > query cancel (if we allow it), backend abrupt exit/crash, and
> 
> Yeah. If we have the
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> the users can then choose to disable processing query cancels/proc
> dies while waiting for sync replication in SyncRepWaitForLSN().

Yes.  We might also change things so a query cancel that happens during 
sychronous replica waiting can only be done by an administrator, not the
session owner.  Again, lots of design needed here.

> > if we
> > should allow users to specify a retry interval to resynchronize the
> > synchronous replicas.
> 
> This is another interesting thing to consider if we were to make the
> auto-removed (by the above approach) standby a sync standby again
> without manual intervention.

Yes, see above.  You are addressing the right questions here.  :-)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: meson PGXS compatibility
Next
From: Andres Freund
Date:
Subject: Re: meson PGXS compatibility