Home > mailing lists

Re: terminating walsender process due to replication timeout - Mailing list pgsql-general

From	AYahorau@ibagroup.eu
Subject	Re: terminating walsender process due to replication timeout
Date	May 30, 2019 09:07:35
Msg-id	OF021E7D57.6028B6B4-ON4325840A.002EA348-4325840A.0032221F@iba.by Whole thread Raw
In response to	Re: terminating walsender process due to replication timeout (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List	pgsql-general

Tree view

Hello,

I took a look in postgresql source code. As far as I understood walsender can send some data to walreceiver regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per wal_sender_timeout seconds (once per 1 second in my case) and expects to get responses from the walreceiver that everything goes ok.

I switched on trace and repeated my test. I found out that walreceiver starts processing of the keepalive message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559 18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0forHeap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK 18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559 18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8for Heap/UPDATE: off 53 xmax 559 ; newoff 129 xmax 0 18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559 18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48forHeap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK 18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559 18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80for Heap/UPDATE: off 54 xmax 559 ; newoff 130 xmax 0 18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559

So because of writing large volume of data it was not able to handle the next messages quickly. It seems not to be related to network bandwidth or CPU saturation.

Thereby, I see some kind of a contradiction with the official description of wal_sender_timeout parameter:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network outage.

During my test the connection between master and standby was active and there was no any network outage. So according to the description there was no need to terminate
replication connection.

So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response time to keepalive messages independent of the amount of data that the walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout documentation so it reflects reality?

Best regards,
Andrei

From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 24/05/2019 09:42
Subject: Re: terminating walsender process due to replication timeout

Hello. At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in <OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by> > Can frequent database operations cause getting a standby server behind? Is > there a way to avoid this situation? > I checked that walsender works well in my test if I set > wal_sender_timeout at least to 5 second. It depends on the transaction (WAL) traffic and the bandwidth of your network. The performacne difference between master and standby also affects. The possibilities I can guess are: - The bandwidth is narrow to the traffic. - The standby performs poorer than the master. - Your network is having a sort of trouble. Virtual network (local network in a virtual environment) tends to suffer network troubles caused by CPU saturation or something else. regards. -- Kyotaro Horiguchi NTT Open Source Software Center

pgsql-general by date:

From: Yonatan Misgan
Date: 30 May 2019, 05:59:17
Subject: Localization

From: "Peter J. Holzer"
Date: 30 May 2019, 10:38:47
Subject: Migrating a Patroni cluster from Ubuntu 16.04 to Ubuntu 18.04

Re: terminating walsender process due to replication timeout - Mailing list pgsql-general

Previous

Next