Re: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Ajin Cherian
Subject Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id CAFPTHDaf9sc3VZWZKr3-xf2jv+gF6q3ywipofc-+5GHyCJRSCQ@mail.gmail.com
Whole thread Raw
In response to Slow catchup of 2PC (twophase) transactions on replica in LR  (Давыдов Виталий <v.davydov@postgrespro.ru>)
Responses Re: Slow catchup of 2PC (twophase) transactions on replica in LR
List pgsql-hackers


On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий <v.davydov@postgrespro.ru> wrote:

Dear All,

I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch.

To reproduce the problem:

  • Setup logical replication from master to replica with subscription parameter twophase =  true.
  • Create some intermediate load on the master (use pgbench with custom sql with prepare+commit)
  • Optionally switch off the replica for some time (keep load on master).
  • Switch on the replica and wait until it reaches the master.

The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load.


I tried this setup and I do see that the logical subscriber does reach the master in a short time. I'm not sure what I'm missing. I stopped the logical subscriber in between while pgbench was running and then started it again and ran the following:
postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
-----------+--------------------
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)

My pgbench command:
pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5

my custom sql file:
cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;

regards,
Ajin Cherian
Fujitsu Australia

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: About a recently-added message
Next
From: Robert Haas
Date:
Subject: Re: RFC: Logging plan of the running query