Re: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Давыдов Виталий
Subject Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id 99204-65d8c800-1-735b8480@191504697
Whole thread Raw
In response to Re: Slow catchup of 2PC (twophase) transactions on replica in LR  (Ajin Cherian <itsajin@gmail.com>)
List pgsql-hackers
Hi Ajin,

Thank you for your feedback. Could you please try to increase the number of clients (-c pgbench option) up to 20 or more? It seems, I forgot to specify it.

With best regards,
Vitaly Davydov
 
On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий <v.davydov@postgrespro.ru> wrote:

Dear All,

I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch.

To reproduce the problem:

  • Setup logical replication from master to replica with subscription parameter twophase =  true.
  • Create some intermediate load on the master (use pgbench with custom sql with prepare+commit)
  • Optionally switch off the replica for some time (keep load on master).
  • Switch on the replica and wait until it reaches the master.

The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load.
 

 
I tried this setup and I do see that the logical subscriber does reach the master in a short time. I'm not sure what I'm missing. I stopped the logical subscriber in between while pgbench was running and then started it again and ran the following:
postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
-----------+--------------------
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)
 
My pgbench command:
pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5
 
my custom sql file:
cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
 
regards,
Ajin Cherian
Fujitsu Australia
 


 

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: locked reads for atomics
Next
From: Tomas Vondra
Date:
Subject: Re: Improve eviction algorithm in ReorderBuffer