Re: Logical replication timeout problem - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Logical replication timeout problem |
Date | |
Msg-id | CAD21AoDe0Cr1T4V6ReUP_Nua3MG++xO4wdiY4+TUDUoLGibA5Q@mail.gmail.com Whole thread Raw |
In response to | RE: Logical replication timeout problem ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>) |
Responses |
RE: Logical replication timeout problem
|
List | pgsql-hackers |
Hi, On Tue, Mar 8, 2022 at 10:25 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, Mar 4, 2022 at 4:26 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > > Thanks for your test and comments. > > > Some codes were added in ReorderBufferProcessTXN() for treating DDL, > > but I doubted updating accept_writes is needed. > > IMU, the parameter is read by OutputPluginPrepareWrite() in order align > > messages. > > They should have a header - like 'w' - before their body. But here only a > > keepalive message is sent, > > no meaningful changes, so I think it might be not needed. > > I commented out the line and tested like you did [1], and no timeout and errors > > were found. > > Do you have any reasons for that? > > > > https://www.postgresql.org/message- > > id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd0 > > 1.prod.outlook.com > Yes, you are right. We should not set accept_writes to true here. > And I looked into the function WalSndUpdateProgress. I found function > WalSndUpdateProgress try to record the time of some message(by function > LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn. > Then, when publisher receives the reply message from the subscriber(function > ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate the > delay time(refer to view pg_stat_replication). > Referring to the purpose of LagTrackerWrite, I think it is no need to log time > when sending keepalive messages here. > So when the parameter send_keep_alive of function WalSndUpdateProgress is true, > skip the recording time. > > > I'm also happy if you give the version number :-). > Introduce version information, starting from version 1. > > Attach the new patch. > 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.] > 2. Introduce version information.[suggestion by Peter, Kuroda-San] I've looked at the patch and have a question: +void +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) +{ + static int skipped_changes_count = 0; + + /* + * skipped_changes_count is reset when processing changes that do not + * need to be skipped. + */ + if (!skipped) + { + skipped_changes_count = 0; + return; + } + + /* + * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try to send a + * keepalive message. + */ + #define SKIPPED_CHANGES_THRESHOLD 10000 + + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) + { + /* Try to send a keepalive message. */ + OutputPluginUpdateProgress(ctx, true); + + /* After trying to send a keepalive message, reset the flag. */ + skipped_changes_count = 0; + } +} Since we send a keepalive after continuously skipping 10000 changes, the originally reported issue can still occur if skipping 10000 changes took more than the timeout and the walsender didn't send any change while that, is that right? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: