Thread: Small fixes needed by high-availability tools

Small fixes needed by high-availability tools

From
Andrey Borodin
Date:
Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I
ratherwant to have a design consensus first. 

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We can disallow new
connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this
restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
     if (PostAuthDelay > 0)
         pg_usleep(PostAuthDelay * 1000000L);

+    /* Check if we need to wait for startup synchronous replication */
+    if (!am_walsender &&
+        !superuser() &&
+        !StartupSyncRepEstablished())
+    {
+        ereport(FATAL,
+                (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+                 errmsg("cannot connect until synchronous replication is established with standbys according to
startup_synchronous_standby_level")));
+    }

We might also want to have some kind of cache that quorum was already established. Also the place where the check is
donemight be not most appropriate. 

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the
solutionis incomplete without first patch in the current thread. 

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as
committed.This leads to loosing data with UPSERTs. 

The key change is how we process cancels in SyncRepWaitForLSN().

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but
beforeby written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose
toadd a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused. 

Currently we just use a separate program lwaldump [4] which just reads WAL until last valid record. In case of failover
pg_consuluses LSNs from lwaldump. This approach works well, but is cumbersome. 


There are other caveats of replication, but IMO these 3 problems are most annoying in terms of data durability.

I'd greatly appreciate any thoughts on this.


Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] https://www.postgresql.org/message-id/flat/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z+jeg@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com#415dc2f7d41b8a251b419256407bb64d
[3] https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[4] https://github.com/g0djan/lwaldump


Attachment

Re: Small fixes needed by high-availability tools

From
Matthias van de Meent
Date:
On Fri, 2 May 2025 at 15:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers!
>
> I want to revive attempts to fix some old edge cases of physical quorum replication.
>
> Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I
ratherwant to have a design consensus first.
 
[...]
> 2. Do not allow to cancel locally written transaction
>
> The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the
solutionis incomplete without first patch in the current thread.
 

I'm trying to figure out where in the thread you find this this "some
agreement". Could you reference the posts you're referring to?

> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as
committed.This leads to loosing data with UPSERTs.
 

Could you explain why specifically UPSERTs would lose data (vs any
other user workload) in cancellations during SyncRepWaitForLSN?

> The key change is how we process cancels in SyncRepWaitForLSN().

I personally think we should rather move to CSN-based snapshots on
both primary and replica (with LSN as CSN), and make visibility of
other transactions depend on how much persistence your session wants
(SQL spec permitting, of course).

I.e., if you have synchronous_commit=remote_apply, you wait with
sending the commit success message until you have confirmation that
your commit LSN has been applied on the configured amount of replicas,
and snapshots are taken based on the latest LSN that is known to be
applied everywhere, but if you have synchronous_commit=off, you could
read the commits (even those committed in s_c=remote_apply sessions)
immediately after they've been included in the logs (potentially with
some added slack to account for system state updates).
Similarly, all snapshots you create in a backend with
synchronous_commit=remote_apply would use the highest LSN which is
remotely applied according to the applicable rules, while
synchronous_commit=off implies "all transactions which have been
logged as committed".
Changing synchronous_commit to a value that requires higher
persistence level would cause the backend to wait for its newest
snapshot LSN to reach that persistence level; IMO an acceptable
trade-off for switching s_c at runtime.

This is based on the assumption that if you don't want your commit to
be very durable, you probably also don't care as much about the
durability of the data you can see, and if you want your commits to be
very durable, you probably want to see only very durable data.

This would also unify the commit visibility order between primary and
secondary nodes, and would allow users to have session-level 'wait for
LSN x to be persistent' with much reduced lock times.

(CC-ed to Ants, given his interest in this topic)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)