Re: Logical replication timeout problem - Mailing list pgsql-hackers

From Fabrice Chapuis
Subject Re: Logical replication timeout problem
Date
Msg-id CAA5-nLCs+XwLma7KPo_GJTKnhXU2YcTD9igMC8uDJV76xk72zQ@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Logical replication timeout problem
List pgsql-hackers
> IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function

> I was curious to know if the walsender has exited before walreceiver

During the last tests we made we didn't observe any timeout of the wal sender process.

> Do you mean you are planning to change from 1 minute to 5 minutes?

We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positive there is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so quickly.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate from your point of view?

Best Regards

Fabrice



On Tue, Sep 21, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
> ...
> * Check for replication timeout. */
>   WalSndCheckTimeOut();
>
> /* Send keepalive if the time has come */
>   WalSndKeepaliveIfNecessary();
> ...
>

Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

> The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
>

It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].

> We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.
>

Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.

> Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.
>

Right, I know that but I was curious to know if the walsender has
exited before walreceiver.

[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html

--
With Regards,
Amit Kapila.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: refactoring basebackup.c
Next
From: "Bossart, Nathan"
Date:
Subject: Re: Estimating HugePages Requirements?