Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers

From SATYANARAYANA NARLAPURAM
Subject Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date
Msg-id CAHg+QDdd7BXB9HD9ddevk_D5TtweEBantcvJ5up5hznryZ33_w@mail.gmail.com
Whole thread Raw
In response to Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
Reviving this thread.

On Sun, Jan 29, 2023 at 9:55 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
For proc die, it looks like the suggestion was to process it
immediately and upon next restart, don't allow user connections unless
all sync standbys were caught up. However, we need to be able to allow
replication connections from standbys so that they'll be able to
stream the needed WAL and catch up with primary, allow superuser or
users with pg_monitor role to connect to perform ALTER SYSTEM to
remove the unresponsive sync standbys if any from the list or disable
sync replication altogether or monitor for flush lsn/catch up status.
And block all other connections. Note that replication, superuser and
users with pg_monitor role connections are allowed only after the
server reaches a consistent state not before that to not read any
inconsistent data.

Allowing replication, superuser and pg_monitor seems reasonable to me.
 

The trickiest part of doing the above is how we detect upon restart
that the server received proc die while waiting for sync replication
ACK. One idea might be to set a flag in the control file before the
crash. Second idea might be to write a marker file (although I don't
favor this idea); presence indicates that the server was waiting for
sync replication ACK before the crash. However, we may not detect all
sorts of crashes in a backend when it is waiting for sync replication
ACK to do any of these two ideas. Therefore, this may not be a
complete solution.

You cannot control the crash, it can be a simple power failure too and none of them could have reached the disk.
Additionally, this is in a critical transaction commit path.
 

Third idea might be to just let the primary wait for sync standbys to
catch up upon restart irrespective of whether it was crashed or not
while waiting for sync replication ACK. While this idea works well
without having to detect all sorts of crashes, the primary may not
come up if any unresponsive standbys are present (currently, the
primary continues to be operational for read-only queries at least
irrespective of whether sync standbys have caught up or not).

I prefer this approach because depending on the quorum policy defined in the synchrnous_standby_names, the primary will open connections for read/writes.
If there is no progress from sync standbys then Postgres admin has to jump in regardless.
 
Thanks,
Satya

pgsql-hackers by date:

Previous
From: Henson Choi
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Next
From: Peter Eisentraut
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)