RE: Logical replication timeout problem - Mailing list pgsql-hackers

From wangw.fnst@fujitsu.com
Subject RE: Logical replication timeout problem
Date
Msg-id OS3PR01MB62750A1360AB7DF6E8F40A909E0A9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Logical replication timeout problem
List pgsql-hackers
On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've looked at the patch and have a question:
Thanks for your review and comments.

> +void
> +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> +        static int skipped_changes_count = 0;
> +
> +        /*
> +         * skipped_changes_count is reset when processing changes that do not
> +         * need to be skipped.
> +         */
> +        if (!skipped)
> +        {
> +                skipped_changes_count = 0;
> +                return;
> +        }
> +
> +        /*
> +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> changes, try to send a
> +         * keepalive message.
> +         */
> +        #define SKIPPED_CHANGES_THRESHOLD 10000
> +
> +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> +        {
> +                /* Try to send a keepalive message. */
> +                OutputPluginUpdateProgress(ctx, true);
> +
> +                /* After trying to send a keepalive message, reset the flag. */
> +                skipped_changes_count = 0;
> +        }
> +}
> 
> Since we send a keepalive after continuously skipping 10000 changes, the
> originally reported issue can still occur if skipping 10000 changes took more than
> the timeout and the walsender didn't send any change while that, is that right?
Yes, theoretically so.
But after testing, I think this value should be conservative enough not to reproduce
this bug.
After the previous discussion[1], it is currently considered that it is better
to directly set a conservative threshold than to calculate the threshold based
on wal_sender_timeout.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

pgsql-hackers by date:

Previous
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Logical replication timeout problem
Next
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Logical replication timeout problem