Thread: Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up?

Hi.

Suppose on master, I run a multi-query using PQexec and save the value returned by pg_current_wal_insert_lsn:

master_lsn = query(master, "INSERT INTO some VALUES (...); SELECT pg_current_wal_insert_lsn()")

Then I run a PQexec query on a replica and save the value returned by pg_last_wal_replay_lsn:

replica_lsn = query(replica, "SELECT pg_last_wal_replay_lsn()")

The question to experts in PG internals: is it guaranteed that, as long as replica_lsn >= master_lsn (GREATER OR EQUAL, not just greater), then a subsequent read from replica will always return me the inserted record (i.e. the replica is up to date), considering noone updates/deletes in that table?

I'm asking, because according to some hints in the docs, this should be true. But for some reason, we have to use "greater" (not "greater or equals") condition in the real code, since with just ">=" the replica doesn't sometimes read the written data.

I'm not sure this fits -hackers..

At Tue, 2 Aug 2022 18:57:41 -0700, Dmitry Koterov <dmitry.koterov@gmail.com> wrote in 
> Hi.
> 
> Suppose on master, I run a *multi-query* using PQexec and save the value
> returned by pg_current_wal_insert_lsn:
> 
> master_lsn = query(master, "INSERT INTO some VALUES (...); SELECT
> pg_current_wal_insert_lsn()")
> 
> Then I run a PQexec query on a replica and save the value returned by
> pg_last_wal_replay_lsn:
> 
> replica_lsn = query(replica, "SELECT pg_last_wal_replay_lsn()")
> 
> The question to experts in PG internals: *is it guaranteed that, as long as
> replica_lsn >= master_lsn (GREATER OR EQUAL, not just greater), then a
> subsequent read from replica will always return me the inserted record*
> (i.e. the replica is up to date), considering noone updates/deletes in that
> table?

https://www.postgresql.org/docs/devel/libpq-exec.html

> The command string can include multiple SQL commands (separated by
> semicolons). Multiple queries sent in a single PQexec call are
> processed in a single transaction, unless there are explicit
> BEGIN/COMMIT commands included in the query string to divide it into
> multiple transactions.

If the query() runs PQexec() with the same string, the call to
pg_current_wal_insert_lsn() is made before the insert is commited.
That behavior can be emulated on psql. (The backslash before semicolon
is crucial. It lets the connected queries be sent in a single
PQexec())

=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E5038
(1 row)
=# insert into t values(0)\; select pg_current_wal_lsn();
INSERT 0 1
 pg_current_wal_lsn 
--------------------
 0/68E5038
(1 row)
=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E50A0
(1 row)

$ pg_waldump -s'0/68E5038' -e'0/68E50A0'  $PGDATA/pg_wal/000000010000000000000006
rmgr: Heap        len (rec/tot):     59/    59, tx:        770, lsn: 0/068E5038, prev 0/068E5000, desc: INSERT off 15
flags0x00, blkref #0: rel 1663/5/16424 blk 0
 
rmgr: Transaction len (rec/tot):     34/    34, tx:        770, lsn: 0/068E5078, prev 0/068E5038, desc: COMMIT
2022-08-0315:49:43.749158 JST
 

So, the replica cannot show the inserted data at the LSN the function
returned.  If you explicitly ended transaction before
pg_current_wal_insert_lsn() call, the expected LSN would be returned.

=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 0/68E75C8
(1 row)
=# begin\;insert into t values(0)\;commit\; select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/68E7958
$ pg_waldump -s'0/68E75C8' -e'0/68E7958'  $PGDATA/pg_wal/000000010000000000000006
 prev 0/068E7590, desc: INSERT off 22 flags 0x00, blkref #0: rel 1663/5/16424 blk 0 FPW
rmgr: Transaction len (rec/tot):     34/    34, tx:        777, lsn: 0/068E7930, prev 0/068E75C8, desc: COMMIT
2022-08-0316:09:13.516498 JST
 


> I'm asking, because according to some hints in the docs, this should be
> true. But for some reason, we have to use "greater" (not "greater or
> equals") condition in the real code, since with just ">=" the replica
> doesn't sometimes read the written data.

Thus the wrong LSN appears to have caused the behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Thank you for the detailed explanation!

I doubt many people from -general would actually be able to provide such info since the spirit of that list is to find work-arounds for problems and questions at user level rather than dig into the details on how something actually works.

It's worth adding to the documentation, with that exact example BTW:
(I can try submitting a docs PR if you think it's a good idea).

Also, when I said that we use PQexec, I did it just for an illustration: in practice we use the node-postgres JS library which sends multi-statement protocol messages. So - transaction wise - it works the same way as PQexec with multiple queries, but it returns responses for ALL queries in the batch, not just for the last one (very convenient BTW, saves on network round-trip latency). This mode is fully supported by PG wire protocol: https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-MULTI-STATEMENT


On Wed, Aug 3, 2022 at 12:32 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

<snip> 
 
=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn
---------------------------
 0/68E5038
(1 row)
=# insert into t values(0)\; select pg_current_wal_lsn();
INSERT 0 1
 pg_current_wal_lsn
--------------------
 0/68E5038
(1 row)
=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn
---------------------------
 0/68E50A0
(1 row) 
 
<snip>

=# select pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn
---------------------------
 0/68E75C8
(1 row)
=# begin\;insert into t values(0)\;commit\; select pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 0/68E7958