Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding - Mailing list pgsql-hackers
From | Ajin Cherian |
---|---|
Subject | Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding |
Date | |
Msg-id | CAFPTHDYvG6Bhbg_HKnJu+uX+vb24kTxE6OQTjH6TDJAZ2-LWJQ@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding
RE: Proposal: Filter irrelevant change before reassemble transactions during logical decoding |
List | pgsql-hackers |
On Fri, Feb 14, 2025 at 6:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 12, 2025 at 10:41 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Wed, Jan 29, 2025 at 9:31 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Ajin, > > > > Some review comments for patch v12-0001. > > > > ====== > > Commit message > > > > 1. > > Track transactions which have snapshot changes with a new flag > > RBTXN_HAS_SNAPSHOT_CHANGES > > > > ~ > > > > The commit message only says *what* it does, but not *why* this patch > > even exists. TBH, I don't understand why this patch needs to be > > separated from your patch 0002, because 0001 makes no independent use > > of the flag, nor is it separately tested. > > > > Anyway, if it is going to remain separated then IMO at least the the > > message should explain the intended purpose e.g. why the subsequent > > patches require this flagged info and how they will use it. > > > > > > Fixed. > > > > I still can't get from 0001's commit message the reason for tracking > the snapshot changes separately. Also, please find my comments for > 0002's commit message. > > > When most changes in a transaction are unfilterable, the overhead of > starting a transaction for each record is significant. > > > > Can you tell what is the exact overhead by testing it? IIRC, that was > the initial approach. It is better if you can mention in the commit > message what was overhead. It will be easier for reviewers. > > > > To reduce this overhead a hash cache of relation file locators is > created. Even then a hash search for every record before recording has > considerable overhead especially for use cases where most tables in an > instance are published. > > > > Again, can you share the link of performance data for this overhead > and if you have not published then please share it and also mention it > in commit message? > I compared the patch 1 which does not employ a hash cache and has the overhead of starting a transaction every time the filter is checked. I created a test setup of 10 million inserts in 3 different scenarios: 1. All inserts on unpublished tables 2. Half of the inserts on unpublished table and half on pupblished table 3. All inserts on published tables. The percentage improvement in the new optimized patch compared to the old patch is: No transactions in publication: 85.39% improvement Half transactions in publication: 72.70% improvement All transactions in publication: 48.47% improvement Attaching a graph to show the difference. > > > To further reduce this overhead a simple approach is used to suspend > filtering for a certain number of changes (100) when an unfilterable > change is encountered. In other words, continue filtering changes if > the last record was filtered out. If an unfilterable change is found, > skip filtering the next 100 changes. > > > > Can we try different thresholds for this like 10, 50, 100, 200, etc. > to decide what is a good threshold value to skip filtering changes? > Of Course this performance might vary from setup to setup but I tried the above setup to compare the 4 different threshold levels Conclusions: Lower throttling values yield better performance, particularly when transactions are included in the publication. Increasing the throttle limit to 200 transactions causes significant performance degradation, particularly when half or all transactions are included. For optimal performance, a moderate throttling value (100 transactions) may be the best balance between performance and processing efficiency. Attaching the graph to show the difference. I'm also attaching the test script that I used. regards, Ajin Cherian Fujitsu Australia
Attachment
pgsql-hackers by date: