Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date
Msg-id CALj2ACVuivbfS4zNGKPRE=rmme2VxC9obBjoOdrw5k+JKVk3UA@mail.gmail.com
Whole thread Raw
In response to Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bruce Momjian <bruce@momjian.us>)
Responses Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Fri, Sep 30, 2022 at 4:23 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> I don't think this patch is going in the right direction, and I think we
> need to step back to see why.

Thanks a lot Bruce for providing insights.

> First, let's see how synchronous replication works.  Each backend waits
> for one or more synchronous replicas to acknowledge the WAL related to
> its commit and then it marks the commit as done in PGPROC and then to
> the client;  I wrote a blog about it:
>
>         https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020

Great!

> So, what happens when an insufficient number of synchronous replicas
> reply?  Sessions hang because the synchronous behavior cannot be
> guaranteed.  We then _allow_ query cancel so the user or administrator
> can get out of the hung sessions and perhaps modify
> synchronous_standby_names.

Right.

> What the proposed patch effectively does is to prevent the ability to
> recovery the hung sessions or auto-continue the sessions if an
> insufficient number of synchronous replicas respond.  So, in a way, it
> is allowing for more strict and less strict behavior of
> synchronous_standby_names.

Yes. I do agree that it makes the sessions further wait and closes the
quick exit path that admins/users have when the problem occurs. But it
disallows users cancelling queries or terminating backends only to end
up in locally-committed-but-not-replicated transactions on a server
setup where sync standbys all are working fine. I agree that we don't
want to further wait on cancels or proc dies as it makes the system
more unresponsive [1].

> However, I think trying to solve this at the session level is the wrong
> approach.  If you set a timeout to continue stuck sessions, odds are the
> timeout will be too long for each session and performance will be
> unacceptable anyway, so you haven't gained much.  If you prevent
> cancels, you effectively lock up the system with fewer options of
> recovery.

Yes.

> I have always felt this has to be done at the server level, meaning when
> a synchronous_standby_names replica is not responding after a certain
> timeout, the administrator must be notified by calling a shell command
> defined in a GUC and all sessions will ignore the replica.  This gives a
> much more predictable and useful behavior than the one in the patch ---
> we have discussed this approach many times on the email lists.

IIUC, each walsender serving a sync standby will determine that the
sync standby isn't responding for a configurable amount of time (less
than wal_sender_timeout) and calls shell command to notify the admin
if there are any backends waiting for sync replication in
SyncRepWaitForLSN(). The shell command then provides the unresponsive
sync standby name at the bare minimum for the admin to ignore it as
sync standby/remove it from synchronous_standby_names to continue
further. This still requires manual intervention which is a problem if
running postgres server instances at scale. Also, having a new shell
command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.

I'm thinking about the possible options that an admin has to get out
of this situation:
1) Removing the standby from synchronous_standby_names.
2) Fixing the sync standby, by restarting or restoring the lost part
(such as network or some other).

(1) is something that postgres can help admins get out of the problem
easily and automatically without any intervention. (2) is something
postgres can't do much about.

How about we let postgres automatically remove an unresponsive (for a
pre-configured time) sync standby from synchronous_standby_names and
inform the user (via log message and via new walsender property and
pg_stat_replication for monitoring purposes)? The users can then
detect such standbys and later try to bring them back to the sync
standbys group or do other things. I believe that a production level
postgres HA with sync standbys will have monitoring to detect the
replication lag, failover decision etc via monitoring
pg_stat_replication. With this approach, a bit more monitoring is
needed. This solution requires less or no manual intervention and
scales well. Please note that I haven't studied the possibilities of
implementing it yet.

Thoughts?

> Once we have that, we can consider removing the cancel ability while
> waiting for synchronous replicas (since we have the timeout) or make it
> optional.  We can also consider how do notify the administrator during
> query cancel (if we allow it), backend abrupt exit/crash, and

Yeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().

> if we
> should allow users to specify a retry interval to resynchronize the
> synchronous replicas.

This is another interesting thing to consider if we were to make the
auto-removed (by the above approach) standby a sync standby again
without manual intervention.

Thoughts?

[1] https://www.postgresql.org/message-id/CA%2BTgmoaCBwgMDkeBDOgtPgHcbfSYq%2BzORjL5DoU3pJyjALxtoQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH v1] [meson] add a default option prefix=/usr/local/pgsql
Next
From: Peter Geoghegan
Date:
Subject: Re: problems with making relfilenodes 56-bits