Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date
Msg-id YzYh3NpCQAFkA6lF@momjian.us
Whole thread Raw
In response to Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
List pgsql-hackers
On Tue, Sep 27, 2022 at 06:52:21PM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > I've explained the problem with the current HA setup with synchronous
> > replication upthread at [1]. Let me reiterate it here once again.
> >
> > When a query is cancelled (a simple stroke of CTRL+C or
> > pg_cancel_backend() call) while the txn is waiting for ack in
> > SyncRepWaitForLSN(); for the client, the txn is actually committed
> > (locally-committed-but-not-yet-replicated to all of sync standbys)
> > like any other txn, a warning is emitted into server logs but it is of
> > no use for the client (think of client as applications). There can be
> > many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
> > can be issued on all of those sessions. The problem is that the
> > subsequent reads will then be able to read all of those
> > locally-committed-but-not-yet-replicated to all of sync standbys txns
> > data - this is what I call an inconsistency (can we call this a
> > read-after-write inconsistency?) because of lack of proper query
> > cancel handling. And if the sync standbys are down or unable to come
> > up for some reason, until then, the primary will be serving clients
> > with the inconsistent data. BTW, I found a report of this problem here
> > [2].
> >
> > The solution proposed for the above problem is to just 'not honor
> > query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
> >
> > When a proc die is pending, then also, there can be
> > locally-committed-but-not-yet-replicated to all of sync standbys txns.
> > Typically, there are two choices for the clients 1) reuse the primary
> > instance after restart 2) failover to one of sync standbys. For case
> > (1), there might be read-after-write inconsistency as explained above.
> > For case (2), those txns might get lost completely if the failover
> > target sync standby or the new primary didn't receive them and the
> > other sync standbys that have received them are now ahead and need a
> > special treatment (run pg_rewind) for them to be able to connect to
> > new primary.
> >
> > The solution proposed for case (1) of the above problem is to 'process
> > the ProcDiePending immediately and upon restart the first backend can
> > wait until the sync standbys are caught up to ensure no inconsistent
> > reads'.
> > The solution proposed for case (2) of the above problem is to 'either
> > run pg_rewind for the sync standbys that are ahead or use the idea
> > proposed at [3]'.
> >
> > I hope the above explanation helps.
> >
> > [1]
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
> > [2]
https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
> > [3] https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com
> 
> I'm attaching the v2 patch rebased on the latest HEAD. Please note
> that there are still some open points, I'm yet to find time to think
> more about them. Meanwhile, I'm posting the v2 patch for making cfbot
> happy. Any further thoughts on the overall design of the patch are
> most welcome. Thanks.
... 
> In PostgreSQL high availability setup with synchronous replication,
> typically all the transactions first locally get committed, then
> streamed to the synchronous standbys and the backends that generated
> the transaction will wait for acknowledgement from synchronous
> standbys. While waiting for acknowledgement, it may happen that the
> query or the transaction gets canceled or the backend that's waiting
> for acknowledgement is asked to exit. In either of these cases, the
> wait for acknowledgement gets canceled and leaves transaction in an
> inconsistent state as it got committed locally but not on the
> standbys. When set the GUC synchronous_replication_naptime_before_cancel
> introduced in this patch, it will let the backends wait for the
> acknowledgement before canceling the wait for acknowledgement so
> that the transaction can be available in synchronous standbys as
> well.

I don't think this patch is going in the right direction, and I think we
need to step back to see why.

First, let's see how synchronous replication works.  Each backend waits
for one or more synchronous replicas to acknowledge the WAL related to
its commit and then it marks the commit as done in PGPROC and then to
the client;  I wrote a blog about it:

    https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020

So, what happens when an insufficient number of synchronous replicas
reply?  Sessions hang because the synchronous behavior cannot be
guaranteed.  We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.

What the proposed patch effectively does is to prevent the ability to
recovery the hung sessions or auto-continue the sessions if an
insufficient number of synchronous replicas respond.  So, in a way, it
is allowing for more strict and less strict behavior of
synchronous_standby_names.

However, I think trying to solve this at the session level is the wrong
approach.  If you set a timeout to continue stuck sessions, odds are the
timeout will be too long for each session and performance will be
unacceptable anyway, so you haven't gained much.  If you prevent
cancels, you effectively lock up the system with fewer options of
recovery.

I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica.  This gives a
much more predictable and useful behavior than the one in the patch ---
we have discussed this approach many times on the email lists.

Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional.  We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, and if we
should allow users to specify a retry interval to resynchronize the
synchronous replicas.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: problems with making relfilenodes 56-bits
Next
From: Peter Geoghegan
Date:
Subject: Re: disfavoring unparameterized nested loops