Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
Date | |
Msg-id | YzYh3NpCQAFkA6lF@momjian.us Whole thread Raw |
In response to | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
List | pgsql-hackers |
On Tue, Sep 27, 2022 at 06:52:21PM +0530, Bharath Rupireddy wrote: > On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > I've explained the problem with the current HA setup with synchronous > > replication upthread at [1]. Let me reiterate it here once again. > > > > When a query is cancelled (a simple stroke of CTRL+C or > > pg_cancel_backend() call) while the txn is waiting for ack in > > SyncRepWaitForLSN(); for the client, the txn is actually committed > > (locally-committed-but-not-yet-replicated to all of sync standbys) > > like any other txn, a warning is emitted into server logs but it is of > > no use for the client (think of client as applications). There can be > > many such txns waiting for ack in SyncRepWaitForLSN() and query cancel > > can be issued on all of those sessions. The problem is that the > > subsequent reads will then be able to read all of those > > locally-committed-but-not-yet-replicated to all of sync standbys txns > > data - this is what I call an inconsistency (can we call this a > > read-after-write inconsistency?) because of lack of proper query > > cancel handling. And if the sync standbys are down or unable to come > > up for some reason, until then, the primary will be serving clients > > with the inconsistent data. BTW, I found a report of this problem here > > [2]. > > > > The solution proposed for the above problem is to just 'not honor > > query cancels at all while waiting for ack in SyncRepWaitForLSN()'. > > > > When a proc die is pending, then also, there can be > > locally-committed-but-not-yet-replicated to all of sync standbys txns. > > Typically, there are two choices for the clients 1) reuse the primary > > instance after restart 2) failover to one of sync standbys. For case > > (1), there might be read-after-write inconsistency as explained above. > > For case (2), those txns might get lost completely if the failover > > target sync standby or the new primary didn't receive them and the > > other sync standbys that have received them are now ahead and need a > > special treatment (run pg_rewind) for them to be able to connect to > > new primary. > > > > The solution proposed for case (1) of the above problem is to 'process > > the ProcDiePending immediately and upon restart the first backend can > > wait until the sync standbys are caught up to ensure no inconsistent > > reads'. > > The solution proposed for case (2) of the above problem is to 'either > > run pg_rewind for the sync standbys that are ahead or use the idea > > proposed at [3]'. > > > > I hope the above explanation helps. > > > > [1] https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com > > [2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication > > [3] https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com > > I'm attaching the v2 patch rebased on the latest HEAD. Please note > that there are still some open points, I'm yet to find time to think > more about them. Meanwhile, I'm posting the v2 patch for making cfbot > happy. Any further thoughts on the overall design of the patch are > most welcome. Thanks. ... > In PostgreSQL high availability setup with synchronous replication, > typically all the transactions first locally get committed, then > streamed to the synchronous standbys and the backends that generated > the transaction will wait for acknowledgement from synchronous > standbys. While waiting for acknowledgement, it may happen that the > query or the transaction gets canceled or the backend that's waiting > for acknowledgement is asked to exit. In either of these cases, the > wait for acknowledgement gets canceled and leaves transaction in an > inconsistent state as it got committed locally but not on the > standbys. When set the GUC synchronous_replication_naptime_before_cancel > introduced in this patch, it will let the backends wait for the > acknowledgement before canceling the wait for acknowledgement so > that the transaction can be available in synchronous standbys as > well. I don't think this patch is going in the right direction, and I think we need to step back to see why. First, let's see how synchronous replication works. Each backend waits for one or more synchronous replicas to acknowledge the WAL related to its commit and then it marks the commit as done in PGPROC and then to the client; I wrote a blog about it: https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020 So, what happens when an insufficient number of synchronous replicas reply? Sessions hang because the synchronous behavior cannot be guaranteed. We then _allow_ query cancel so the user or administrator can get out of the hung sessions and perhaps modify synchronous_standby_names. What the proposed patch effectively does is to prevent the ability to recovery the hung sessions or auto-continue the sessions if an insufficient number of synchronous replicas respond. So, in a way, it is allowing for more strict and less strict behavior of synchronous_standby_names. However, I think trying to solve this at the session level is the wrong approach. If you set a timeout to continue stuck sessions, odds are the timeout will be too long for each session and performance will be unacceptable anyway, so you haven't gained much. If you prevent cancels, you effectively lock up the system with fewer options of recovery. I have always felt this has to be done at the server level, meaning when a synchronous_standby_names replica is not responding after a certain timeout, the administrator must be notified by calling a shell command defined in a GUC and all sessions will ignore the replica. This gives a much more predictable and useful behavior than the one in the patch --- we have discussed this approach many times on the email lists. Once we have that, we can consider removing the cancel ability while waiting for synchronous replicas (since we have the timeout) or make it optional. We can also consider how do notify the administrator during query cancel (if we allow it), backend abrupt exit/crash, and if we should allow users to specify a retry interval to resynchronize the synchronous replicas. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
pgsql-hackers by date: