Re: Logical replication timeout problem - Mailing list pgsql-hackers

From Fabrice Chapuis
Subject Re: Logical replication timeout problem
Date
Msg-id CAA5-nLCaX4Bz2MW-gupH8Ovn+r8KrCSwLW3jiUKcPf555J=5+w@mail.gmail.com
Whole thread Raw
In response to RE: Logical replication timeout problem  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
Responses Re: Logical replication timeout problem
List pgsql-hackers
Thanks for your patch, it also works well when executing our use case, the timeout no longer appears in the logs. Is it necessary now to refine this patch and make as few changes as possible in order for it to be released?

On Fri, Jan 21, 2022 at 10:51 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
>   if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
>   return;
>
> This will help us to rule out one possibility.

Thanks for your advices!

According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0

Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).

I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.

When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.

I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log

Attach the patches and test script mentioned above, in case someone wants to
try.

Regards,
Wang wei

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Shruthi Gowda
Date:
Subject: Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)