Thread: Active connections are terminated because of small wal_sender_timeout

Active connections are terminated because of small wal_sender_timeout

From
AYahorau@ibagroup.eu
Date:
Hello PostgreSQL Community!

Not long ago I opened an issue regarding the problem about:
"terminating walsender process due to replication timeout" problem:
https://www.postgresql.org/message-id/OF85C33E30.171C1C23-ON432583F9.003F5B16-432583F9.003FBAD7%40iba.by

Thanks Everyone for your comprehensive explanations.
Basically I faced this problem because of small wal_sender_timeout value (1 second) and quite large volume of data to be replicated.
There were some assumptions about some problems with my network. But during my investigation I assured that it was not network problem.


I took a look in postgresql source code. As far as I understood walsender can send some data to walreceiver regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per wal_sender_timeout seconds (once per 1 second in my case) and expects to get responses from the walreceiver that everything goes ok.


I switched on trace and repeated my test. I found out that walreceiver  starts processing of the "keepalive" message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries  regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559  latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559

So because of writing large volume of data it was not able to handle the next messages quickly. It seems not to be related to network bandwidth or CPU saturation.

Thereby, I see some kind of a contradiction with the official description of wal_sender_timeout parameter:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network outage.


During my test the connection between master and standby was active and there was no any network outage. So according to the description there was no need to terminate replication connection.


So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response
time to keepalive messages independent of the amount of data that the walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout documentation so it reflects reality?



Best regards,
Andrei Yahorau

Re: Active connections are terminated because of small wal_sender_timeout

From
AYahorau@ibagroup.eu
Date:
Hello Everyone!
Do you have any thoughts regarding this issue?

Best regards,
Andrei Yahorau



From:        Andrei Yahorau/IBA
To:        pgsql-general@postgresql.org,
Date:        21/06/2019 11:14
Subject:        Active connections are terminated because of small wal_sender_timeout



Hello PostgreSQL Community!

Not long ago I opened an issue regarding the problem about:
"terminating walsender process due to replication timeout" problem:
https://www.postgresql.org/message-id/OF85C33E30.171C1C23-ON432583F9.003F5B16-432583F9.003FBAD7%40iba.by

Thanks Everyone for your comprehensive explanations.
Basically I faced this problem because of small wal_sender_timeout value (1 second) and quite large volume of data to be replicated.
There were some assumptions about some problems with my network. But during my investigation I assured that it was not network problem.


I took a look in postgresql source code. As far as I understood walsender can send some data to walreceiver regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per wal_sender_timeout seconds (once per 1 second in my case) and expects to get responses from the walreceiver that everything goes ok.


I switched on trace and repeated my test. I found out that walreceiver  starts processing of the "keepalive" message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries  regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559  latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559

So because of writing large volume of data it was not able to handle the next messages quickly. It seems not to be related to network bandwidth or CPU saturation.

Thereby, I see some kind of a contradiction with the official description of wal_sender_timeout parameter:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network outage.


During my test the connection between master and standby was active and there was no any network outage. So according to the description there was no need to terminate replication connection.


So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response
time to keepalive messages independent of the amount of data that the walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout documentation so it reflects reality?



Best regards,
Andrei Yahorau
AYahorau@ibagroup.eu writes:
> Do you have any thoughts regarding this issue? 

I do not think anybody thinks this is a bug.  Setting wal_sender_timeout
too small is a configuration mistake.

            regards, tom lane



Re: Active connections are terminated because of smallwal_sender_timeout

From
Michael Paquier
Date:
On Fri, Jul 05, 2019 at 10:03:16AM -0400, Tom Lane wrote:
> AYahorau@ibagroup.eu writes:
>> Do you have any thoughts regarding this issue?
>
> I do not think anybody thinks this is a bug.  Setting wal_sender_timeout
> too small is a configuration mistake.

Yeah.  I don't see any bug here.  Please note that it can be also a
problem to set up a too high value in some configuration setups.  The
lack of flexibility in this area is why wal_sender_timeout has been
switch to be user-settable in v12.  In short you can configure it in
the connection string to enforce a custom value per standby.
--
Michael

Attachment

Re: Active connections are terminated because of small wal_sender_timeout

From
AYahorau@ibagroup.eu
Date:
Hello Everyone!

>
 I do not think anybody thinks this is a bug.  Setting wal_sender_timeout
> too small is a configuration mistake.

I don't understand why it is a mistake. 1second is acceptable value for wal_sender_timeout.
Moreover the behaviour contradicts with the official description for wal_sender_timeout:
Terminate replication connections that are inactive longer than the specified number of milliseconds.

First  of all  the connection between the master and standby  was good in my example. The problem was in keepalive message processing (because of busy standby server).
So, here 2 variants are possible:
1). Inappropriate keepalive message processing which contradicts with wal_sender_timeout description.

2) Incorrect description of wal_sender_timeout parameter in the documentation.

> Yeah. I don't see any bug here. Please note that it can be also a problem to set up a too high value in some configuration setups. The lack of flexibility in this area is why wal_sender_timeout has been switch to be
> user-settable in v12. In short you can configure it in the connection string to enforce a custom value per standby.
Thanks for this announcement, This enhancement looks very useful.

Best regards,
Andrei Yahorau

Re: Active connections are terminated because of small wal_sender_timeout

From
AYahorau@ibagroup.eu
Date:
Hello Everyone!

Sorry for being persistent.

> I do not think anybody thinks this is a bug.  Setting wal_sender_timeout
> too small is a configuration mistake.


Why is it a configuration mistake? This value is allowed  to be set. There is no any restriction about it.
I would like to ask a question regarding wal_sender_timeout description and its real behaviour.
Description:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
Real behaviour:
A standby server can be busy and does not send "keepalive" packets to the master.
(By the way wal_sender_timeot cannot be not so small as in my tests. This situation can happen when wal_sender_timeout=10seconds and so on and and so forth).

Does it work properly according to the description?  Don't you see the contradiction between the description and real behaviour?  As I mentioned before the connection between master and standby is good. There is no any problem.
So where is a bug: in the description or in the implementation?

> Yeah.  I don't see any bug here.  Please note that it can be also a
> problem to set up a too high value in some configuration setups.  The
> lack of flexibility in this area is why wal_sender_timeout has been
> switch to be user-settable in v12.  In short you can configure it in
> the connection string to enforce a custom value per standby.


Will a small value of  wal_sender_timeout in PostgreSQL v12  lead to the same failure "terminating walsender process due to replication timeout" as we observe in v11 ?

Best regards,
Andrei Yahorau
AYahorau@ibagroup.eu writes:
>> I do not think anybody thinks this is a bug.  Setting wal_sender_timeout
>> too small is a configuration mistake.

> Why is it a configuration mistake? This value is allowed  to be set. There 
> is no any restriction about it.

The fact that a parameter can be set does not mean that we guarantee that
every possible value will work for everybody.  As an example, if you
configure work_mem to a few TB and then start getting OOM failures because
your machine can't actually support that, it's not a bug that we let you
set the value that high.  The upper limit of what's usable is too variable
and too hard to determine, so we don't even try; it's on you to choose a
suitable setting for your situation.  Similarly, the useful range of
wal_sender_timeout is hard to predict and is likely to be quite different
for different installations.  We don't want to artificially constrain what
people can use, so the range of allowed settings *ought* to include some
values that are not practically useful in specific situations.  Those
values might be just the right ones for someone else.

            regards, tom lane