Re: Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up? - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up?
Date
Msg-id 20220803.163217.1789690807623885906.horikyota.ntt@gmail.com
Whole thread Raw
In response to Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up?  (Dmitry Koterov <dmitry.koterov@gmail.com>)
Responses Re: Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up?
List pgsql-hackers
I'm not sure this fits -hackers..

At Tue, 2 Aug 2022 18:57:41 -0700, Dmitry Koterov <dmitry.koterov@gmail.com> wrote in 
> Hi.
> 
> Suppose on master, I run a *multi-query* using PQexec and save the value
> returned by pg_current_wal_insert_lsn:
> 
> master_lsn = query(master, "INSERT INTO some VALUES (...); SELECT
> pg_current_wal_insert_lsn()")
> 
> Then I run a PQexec query on a replica and save the value returned by
> pg_last_wal_replay_lsn:
> 
> replica_lsn = query(replica, "SELECT pg_last_wal_replay_lsn()")
> 
> The question to experts in PG internals: *is it guaranteed that, as long as
> replica_lsn >= master_lsn (GREATER OR EQUAL, not just greater), then a
> subsequent read from replica will always return me the inserted record*
> (i.e. the replica is up to date), considering noone updates/deletes in that
> table?

https://www.postgresql.org/docs/devel/libpq-exec.html

> The command string can include multiple SQL commands (separated by
> semicolons). Multiple queries sent in a single PQexec call are
> processed in a single transaction, unless there are explicit
> BEGIN/COMMIT commands included in the query string to divide it into
> multiple transactions.

If the query() runs PQexec() with the same string, the call to
pg_current_wal_insert_lsn() is made before the insert is commited.
That behavior can be emulated on psql. (The backslash before semicolon
is crucial. It lets the connected queries be sent in a single
PQexec())

=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E5038
(1 row)
=# insert into t values(0)\; select pg_current_wal_lsn();
INSERT 0 1
 pg_current_wal_lsn 
--------------------
 0/68E5038
(1 row)
=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E50A0
(1 row)

$ pg_waldump -s'0/68E5038' -e'0/68E50A0'  $PGDATA/pg_wal/000000010000000000000006
rmgr: Heap        len (rec/tot):     59/    59, tx:        770, lsn: 0/068E5038, prev 0/068E5000, desc: INSERT off 15
flags0x00, blkref #0: rel 1663/5/16424 blk 0
 
rmgr: Transaction len (rec/tot):     34/    34, tx:        770, lsn: 0/068E5078, prev 0/068E5038, desc: COMMIT
2022-08-0315:49:43.749158 JST
 

So, the replica cannot show the inserted data at the LSN the function
returned.  If you explicitly ended transaction before
pg_current_wal_insert_lsn() call, the expected LSN would be returned.

=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E75C8
(1 row)
=# begin\;insert into t values(0)\;commit\; select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/68E7958
$ pg_waldump -s'0/68E75C8' -e'0/68E7958'  $PGDATA/pg_wal/000000010000000000000006
 prev 0/068E7590, desc: INSERT off 22 flags 0x00, blkref #0: rel 1663/5/16424 blk 0 FPW
rmgr: Transaction len (rec/tot):     34/    34, tx:        777, lsn: 0/068E7930, prev 0/068E75C8, desc: COMMIT
2022-08-0316:09:13.516498 JST
 


> I'm asking, because according to some hints in the docs, this should be
> true. But for some reason, we have to use "greater" (not "greater or
> equals") condition in the real code, since with just ">=" the replica
> doesn't sometimes read the written data.

Thus the wrong LSN appears to have caused the behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Race between KeepFileRestoredFromArchive() and restartpoint
Next
From: Masahiko Sawada
Date:
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns