Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers

From Melih Mutlu
Subject Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date
Msg-id CAGPVpCQdZ_oj-QFcTOhTrUTs-NCKrrZ=ZNCNPR1qe27rXV-iYw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Responses Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi,

Attached new versions of the patch with some changes/fixes.

Here also some numbers to compare the performance of log. rep. with this patch against the current master branch.
 
My method of benchmarking is the same with what I did earlier in this thread. (on a different environment, so not compare the result from this email with the ones from earlier emails)
With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch. "max_sync_workers_per_subscription" is set to 2 as default. 
Got some results simply averaging timings from 5 consecutive runs for each branch.

Since this patch is expected to improve log. rep. of empty/close-to-empty tables, started with measuring performance with empty tables.

            |  10 tables      |  100 tables        |  1000 tables
------------------------------------------------------------------------------
master |  283.430 ms  |  22739.107 ms  |  105226.177 ms
------------------------------------------------------------------------------
 patch  |  189.139 ms  |  1554.802 ms    |  23091.434 ms

After the changes discussed here [1], concurrent replication origin drops by apply worker and tablesync workers may hold each other on wait due to locks taken by replorigin_drop_by_name.
I see that this harms the performance of logical replication quite a bit in terms of speed.
[1] https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com
 
Firstly, as I mentioned, replication origin drops made things worse for the master branch. 
Locks start being a more serious issue when the number of tables increases.
The patch reuses the origin so does not need to drop them in each iteration. That's why the difference between the master and the patch is more significant now than it was when I first sent the patch.

To just show that the improvement is not only the result of reuse of origins, but also reuse of rep. slots and workers, I just reverted those commits which causes the origin drop issue.

              |  10 tables      |  100 tables        |  1000 tables
-----------------------------------------------------------------------------
reverted |  270.012 ms  |  2483.907 ms   |  31660.758 ms
-----------------------------------------------------------------------------
 patch    |  189.139 ms  |  1554.802 ms   |  23091.434 ms

With this patch, logical replication is still faster, even if we wouldn't have an issue with rep. origin drops. 

Also here are some numbers with 10 tables loaded with some data :

             |     10 MB          |     100 MB           
----------------------------------------------------------
master  |  2868.524 ms   |  14281.711 ms   
----------------------------------------------------------
 patch   |  1750.226 ms   |  14592.800 ms 

The difference between the master and the patch is getting close when the size of tables increase, as expected.


I would appreciate any feedback/thought on the approach/patch/numbers etc.

Thanks,
--
Melih Mutlu
Microsoft
Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: plpgsq_plugin's stmt_end() is not called when an error is caught
Next
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply