Re: Logical replication timeout problem - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Logical replication timeout problem |
Date | |
Msg-id | CAExHW5uALmiA4ZHSAdVwjsqiUCRzy+BjKr4RS_MpDomHuvJcLA@mail.gmail.com Whole thread Raw |
In response to | RE: Logical replication timeout problem ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>) |
Responses |
Re: Logical replication timeout problem
RE: Logical replication timeout problem |
List | pgsql-hackers |
Hi Wang, Thanks for working on this. One of our customer faced a similar situation when running BDR with PostgreSQL. I tested your patch and it solves the problem. Please find some review comments below On Tue, Nov 8, 2022 at 8:34 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > > Attach the patch. > +/* + * Helper function for ReorderBufferProcessTXN for updating progress. + */ +static inline void +ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn, + ReorderBufferChange *change) +{ + LogicalDecodingContext *ctx = rb->private_data; + static int changes_count = 0; It's not easy to know that a variable is static when reading the code which uses it. So it's easy to interpret code wrong. I would probably track it through logical decoding context itself OR through a global variable like other places where we track the last timestamps. But there's more below on this. + + if (!ctx->update_progress) + return; + + Assert(!ctx->fast_forward); + + /* set output state */ + ctx->accept_writes = false; + ctx->write_xid = txn->xid; + ctx->write_location = change->lsn; + ctx->end_xact = false; This patch reverts many of the changes of the previous commit which tried to fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but without much explanation of that in the commit message. Its only user, WalSndUpdateProgress(), is probably making a wrong assumption as well. * We don't have a mechanism to get the ack for any LSN other than end * xact LSN from the downstream. So, we track lag only for end of * transaction LSN. IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is sent downstream through a keep alive message. Downstream may acknowledge this LSN. So we do get ack for any LSN, not just commit LSN. So I propose removing end_xact as well. + + /* + * We don't want to try sending a keepalive message after processing each + * change as that can have overhead. Tests revealed that there is no + * noticeable overhead in doing it after continuously processing 100 or so + * changes. + */ +#define CHANGES_THRESHOLD 100 I think a time based threashold makes more sense. What if the timeout was nearing and those 100 changes just took little more time causing a timeout? We already have a time based threashold in WalSndKeepaliveIfNecessary(). And that function is invoked after reading every WAL record in WalSndLoop(). So it does not look like it's an expensive function. If it is expensive we might want to worry about WalSndLoop as well. Does it make more sense to remove this threashold? + + /* + * After continuously processing CHANGES_THRESHOLD changes, we + * try to send a keepalive message if required. + */ + if (++changes_count >= CHANGES_THRESHOLD) + { + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); + changes_count = 0; + } +} + On the other thread, I mentioned that we don't have a TAP test for it. I agree with Amit's opinion there that it's hard to create a test which will timeout everywhere. I think what we need is a way to control the time required for decoding a transaction. A rough idea is to induce a small sleep after decoding every change. The amount of sleep * number of changes will help us estimate and control the amount of time taken to decode a transaction. Then we create a transaction which will take longer than the timeout threashold to decode. But that's a significant code. I don't think PostgreSQL has a facility to induce a delay at a particular place in the code. -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: