Small fixes needed by high-availability tools - Mailing list pgsql-hackers

From Andrey Borodin
Subject Small fixes needed by high-availability tools
Date
Msg-id 0B44E464-BA62-4056-9465-3320DD2D0302@yandex-team.ru
Whole thread Raw
List pgsql-hackers
Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I
ratherwant to have a design consensus first. 

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We can disallow new
connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this
restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
     if (PostAuthDelay > 0)
         pg_usleep(PostAuthDelay * 1000000L);

+    /* Check if we need to wait for startup synchronous replication */
+    if (!am_walsender &&
+        !superuser() &&
+        !StartupSyncRepEstablished())
+    {
+        ereport(FATAL,
+                (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+                 errmsg("cannot connect until synchronous replication is established with standbys according to
startup_synchronous_standby_level")));
+    }

We might also want to have some kind of cache that quorum was already established. Also the place where the check is
donemight be not most appropriate. 

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the
solutionis incomplete without first patch in the current thread. 

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as
committed.This leads to loosing data with UPSERTs. 

The key change is how we process cancels in SyncRepWaitForLSN().

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but
beforeby written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose
toadd a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused. 

Currently we just use a separate program lwaldump [4] which just reads WAL until last valid record. In case of failover
pg_consuluses LSNs from lwaldump. This approach works well, but is cumbersome. 


There are other caveats of replication, but IMO these 3 problems are most annoying in terms of data durability.

I'd greatly appreciate any thoughts on this.


Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] https://www.postgresql.org/message-id/flat/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z+jeg@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com#415dc2f7d41b8a251b419256407bb64d
[3] https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[4] https://github.com/g0djan/lwaldump


Attachment

pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: queryId constant squashing does not support prepared statements
Next
From: Peter Eisentraut
Date:
Subject: Re: extension_control_path and "directory"