Re: Logical replication timeout problem - Mailing list pgsql-hackers

From Euler Taveira
Subject Re: Logical replication timeout problem
Date
Msg-id ef68bb18-37f9-4303-998e-a15d2fdb2563@www.fastmail.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Logical replication timeout problem  (Amit Kapila <amit.kapila16@gmail.com>)
RE: Logical replication timeout problem  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
List pgsql-hackers
On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
>
> So I skip tracking lag during a transaction just like the current HEAD.
> Attach the new patch.
>

Thanks, please find the updated patch where I have slightly modified
the comments.

Sawada-San, Euler, do you have any opinion on this approach? I
personally still prefer the approach implemented in v10 [1] especially
due to the latest finding by Wang-San that we can't update the
lag-tracker apart from when it is invoked at the transaction end.
However, I am fine if we like this approach more.
It seems v15 is simpler and less error prone than v10. v10 has a mix of
OutputPluginUpdateProgress() and the new function update_progress(). The v10
also calls update_progress() for every change action in pgoutput_change(). It
is not a good approach for maintainability -- new changes like sequences need
extra calls. However, as you mentioned there should handle the track lag case.

Both patches change the OutputPluginUpdateProgress() so it cannot be
backpatched. Are you planning to backpatch it? If so, the boolean variable
(last_write or end_xacts depending of which version you are considering) could
be added to LogicalDecodingContext. (You should probably consider this approach
for skipped_xact too)

+ * For a large transaction, if we don't send any change to the downstream for a
+ * long time then it can timeout. This can happen when all or most of the
+ * changes are either not published or got filtered out.

We should probable mention that "long time" is wal_receiver_timeout on
subscriber.

+    * change as that can have overhead. Testing reveals that there is no
+    * noticeable overhead in doing it after continuously processing 100 or so
+    * changes.

Tests revealed that ...

+    * We don't have a mechanism to get the ack for any LSN other than end xact
+    * lsn from the downstream. So, we track lag only for end xact lsn's.

s/lsn/LSN/ and s/lsn's/LSNs/

I would say "end of transaction LSN".

+ * If too many changes are processed then try to send a keepalive message to
+ * receiver to avoid timeouts.

In logical replication, if too many changes are processed then try to send a
keepalive message. It might avoid a timeout in the subscriber.

Does this same issue occur for long transactions? I mean keep a long
transaction open and execute thousands of transactions.

BEGIN;
INSERT INTO foo (a) VALUES(1);
-- wait a few hours while executing 10^x transactions
INSERT INTO foo (a) VALUES(2);
COMMIT;


--
Euler Taveira

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Logical replication timeout problem
Next
From: Frédéric Yhuel
Date:
Subject: Re: Allow parallel plan for referential integrity checks?