Thread: terminating walsender process due to replication timeout
Hello PostgreSQL Community!
I faced an issue on my linux machine using Postgres 11.3 .
I have 2 nodes in db cluster: master and standby.
I tried to perform a plenty of long-running queries which lead to the databases desynchronization:
terminating walsender process due to replication timeout
Here is the output in debug mode:
2019-05-13 13:21:33 FET 00000 DEBUG: sending replication keepalive
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 LOG: terminating walsender process due to replication timeout
The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/ and run the following command:
psql -U user1 -f demo_small.sql db1
and I get the observed behaviour.
I know that I can increase wal_sender_timeout value to avoid this behaviour (currently wal_sender_timeout is equal to 1 second.)
To be honest I don't want to increase wal_sender_timeout because I would like to detect some network issues quickly.
After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com which was fixed in PostgreSQL 9.4.16.
Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ?
Is there any other chance to avoid it without increasing wal_sender_timeout?
Thank you in advance.
Regards,
Andrei
I faced an issue on my linux machine using Postgres 11.3 .
I have 2 nodes in db cluster: master and standby.
I tried to perform a plenty of long-running queries which lead to the databases desynchronization:
terminating walsender process due to replication timeout
Here is the output in debug mode:
2019-05-13 13:21:33 FET 00000 DEBUG: sending replication keepalive
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 LOG: terminating walsender process due to replication timeout
The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/ and run the following command:
psql -U user1 -f demo_small.sql db1
and I get the observed behaviour.
I know that I can increase wal_sender_timeout value to avoid this behaviour (currently wal_sender_timeout is equal to 1 second.)
To be honest I don't want to increase wal_sender_timeout because I would like to detect some network issues quickly.
After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com which was fixed in PostgreSQL 9.4.16.
Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ?
Is there any other chance to avoid it without increasing wal_sender_timeout?
Thank you in advance.
Regards,
Andrei
To detect network issues maybe you could monitor replication delay.
On Mon, May 13, 2019 at 6:42 AM <AYahorau@ibagroup.eu> wrote:
Hello PostgreSQL Community!
I faced an issue on my linux machine using Postgres 11.3 .
I have 2 nodes in db cluster: master and standby.
I tried to perform a plenty of long-running queries which lead to the databases desynchronization:
terminating walsender process due to replication timeout
Here is the output in debug mode:
2019-05-13 13:21:33 FET 00000 DEBUG: sending replication keepalive
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 LOG: terminating walsender process due to replication timeout
The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/ and run the following command:
psql -U user1 -f demo_small.sql db1
and I get the observed behaviour.
I know that I can increase wal_sender_timeout value to avoid this behaviour (currently wal_sender_timeout is equal to 1 second.)
To be honest I don't want to increase wal_sender_timeout because I would like to detect some network issues quickly.
After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com which was fixed in PostgreSQL 9.4.16.
Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ?
Is there any other chance to avoid it without increasing wal_sender_timeout?
Thank you in advance.
Regards,
Andrei
Hello,
Thank You for the response.
Yes that's possible to monitor replication delay. But my questions were not about monitoring network issues.
I use exactly wal_sender_timeout=1s because it allows to detect replication problems quickly.
So, I need clarification to the following questions:
Is it possible to use exactly this configuration and be sure that it will be work properly.
What did I do wrong? Should I correct my configuration somehow?
Is this the same issue as mentioned here: https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ? If it is so, why I do I face this problem again?
Thank you in advance.
Best regards,
Andrei
From: Rene Romero Benavides <rene.romero.b@gmail.com>
To: AYahorau@ibagroup.eu,
Cc: Postgres General <pgsql-general@postgresql.org>
Date: 14/05/2019 20:12
Subject: Re: terminating walsender process due to replication timeout
To detect network issues maybe you could monitor replication delay.
On Mon, May 13, 2019 at 6:42 AM <AYahorau@ibagroup.eu> wrote:
Hello PostgreSQL Community!
I faced an issue on my linux machine using Postgres 11.3 .
I have 2 nodes in db cluster: master and standby.
I tried to perform a plenty of long-running queries which lead to the databases desynchronization:
terminating walsender process due to replication timeout
Here is the output in debug mode:
2019-05-13 13:21:33 FET 00000 DEBUG: sending replication keepalive
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 LOG: terminating walsender process due to replication timeout
The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/ and run the following command:
psql -U user1 -f demo_small.sql db1
and I get the observed behaviour.
I know that I can increase wal_sender_timeout value to avoid this behaviour (currently wal_sender_timeout is equal to 1 second.)
To be honest I don't want to increase wal_sender_timeout because I would like to detect some network issues quickly.
After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com which was fixed in PostgreSQL 9.4.16.
Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ?
Is there any other chance to avoid it without increasing wal_sender_timeout?
Thank you in advance.
Regards,
Andrei
--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/
Thank You for the response.
Yes that's possible to monitor replication delay. But my questions were not about monitoring network issues.
I use exactly wal_sender_timeout=1s because it allows to detect replication problems quickly.
So, I need clarification to the following questions:
Is it possible to use exactly this configuration and be sure that it will be work properly.
What did I do wrong? Should I correct my configuration somehow?
Is this the same issue as mentioned here: https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ? If it is so, why I do I face this problem again?
Thank you in advance.
Best regards,
Andrei
From: Rene Romero Benavides <rene.romero.b@gmail.com>
To: AYahorau@ibagroup.eu,
Cc: Postgres General <pgsql-general@postgresql.org>
Date: 14/05/2019 20:12
Subject: Re: terminating walsender process due to replication timeout
To detect network issues maybe you could monitor replication delay.
On Mon, May 13, 2019 at 6:42 AM <AYahorau@ibagroup.eu> wrote:
Hello PostgreSQL Community!
I faced an issue on my linux machine using Postgres 11.3 .
I have 2 nodes in db cluster: master and standby.
I tried to perform a plenty of long-running queries which lead to the databases desynchronization:
terminating walsender process due to replication timeout
Here is the output in debug mode:
2019-05-13 13:21:33 FET 00000 DEBUG: sending replication keepalive
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 DEBUG: CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-05-13 13:21:34 FET 00000 LOG: terminating walsender process due to replication timeout
The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/ and run the following command:
psql -U user1 -f demo_small.sql db1
and I get the observed behaviour.
I know that I can increase wal_sender_timeout value to avoid this behaviour (currently wal_sender_timeout is equal to 1 second.)
To be honest I don't want to increase wal_sender_timeout because I would like to detect some network issues quickly.
After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com which was fixed in PostgreSQL 9.4.16.
Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com ?
Is there any other chance to avoid it without increasing wal_sender_timeout?
Thank you in advance.
Regards,
Andrei
--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/
Hello. At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by> > Hello, > Thank You for the response. > > Yes that's possible to monitor replication delay. But my questions were > not about monitoring network issues. > > I use exactly wal_sender_timeout=1s because it allows to detect > replication problems quickly. Though I don't have an exact idea of your configuration, it seems to me that your standby is simply getting behind more than one second from the master. If you regard the fact as a problem of replication, the configuration can be said to be finding the problem correctly. Since the keep-alive packet is sent in-band, it doesn't get to the standby before already-sent-but-not-processed packets. > So, I need clarification to the following questions: > Is it possible to use exactly this configuration and be sure that it will > be work properly. > What did I do wrong? Should I correct my configuration somehow? > Is this the same issue as mentioned here: > https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com > ? If it is so, why I do I face this problem again? It is not the same "problem". What was mentioned there is fast network making the sender-side loop busy, which prevents keepalive packet from sending. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello.
Thanks for the answer.
Can frequent database operations cause getting a standby server behind? Is there a way to avoid this situation?
I checked that walsender works well in my test if I set wal_sender_timeout at least to 5 second.
Best regards,
Andrei Yahorau
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: rene.romero.b@gmail.com, pgsql-general@postgresql.org
Date: 16/05/2019 10:36
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by>
> Hello,
> Thank You for the response.
>
> Yes that's possible to monitor replication delay. But my questions were
> not about monitoring network issues.
>
> I use exactly wal_sender_timeout=1s because it allows to detect
> replication problems quickly.
Though I don't have an exact idea of your configuration, it seems
to me that your standby is simply getting behind more than one
second from the master. If you regard the fact as a problem of
replication, the configuration can be said to be finding the
problem correctly.
Since the keep-alive packet is sent in-band, it doesn't get to
the standby before already-sent-but-not-processed packets.
> So, I need clarification to the following questions:
> Is it possible to use exactly this configuration and be sure that it will
> be work properly.
> What did I do wrong? Should I correct my configuration somehow?
> Is this the same issue as mentioned here:
> https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
> ? If it is so, why I do I face this problem again?
It is not the same "problem". What was mentioned there is fast
network making the sender-side loop busy, which prevents
keepalive packet from sending.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Thanks for the answer.
Can frequent database operations cause getting a standby server behind? Is there a way to avoid this situation?
I checked that walsender works well in my test if I set wal_sender_timeout at least to 5 second.
Best regards,
Andrei Yahorau
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: rene.romero.b@gmail.com, pgsql-general@postgresql.org
Date: 16/05/2019 10:36
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by>
> Hello,
> Thank You for the response.
>
> Yes that's possible to monitor replication delay. But my questions were
> not about monitoring network issues.
>
> I use exactly wal_sender_timeout=1s because it allows to detect
> replication problems quickly.
Though I don't have an exact idea of your configuration, it seems
to me that your standby is simply getting behind more than one
second from the master. If you regard the fact as a problem of
replication, the configuration can be said to be finding the
problem correctly.
Since the keep-alive packet is sent in-band, it doesn't get to
the standby before already-sent-but-not-processed packets.
> So, I need clarification to the following questions:
> Is it possible to use exactly this configuration and be sure that it will
> be work properly.
> What did I do wrong? Should I correct my configuration somehow?
> Is this the same issue as mentioned here:
> https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
> ? If it is so, why I do I face this problem again?
It is not the same "problem". What was mentioned there is fast
network making the sender-side loop busy, which prevents
keepalive packet from sending.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Everyone!
I can simplify and describe the issue I faced.
I have 2 nodes in db cluster: master and standby.
I create a simple table on master node by a command via psql:
CREATE TABLE table1 (a INTEGER);
After this I fill the table by COPY command from a file which contains 2000000 (2 million) entries.
And in case when I run for example such a command:
UPDATE table1 SET a='1'
or such a command:
DELETE FROM table1;
I see in PostgreSQL log the an entry: terminating walsender process due to replication timeout.
I suppose that this issue caused by small value of wal_sender_timeout=1s and long runtime of the queries (it takes about 15 seconds).
What is the best way to proceed it? How to avoid this? Is there any additional configuration which can help here?
Regards,
Andrei
From: Andrei Yahorau/IBA
To: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 17/05/2019 11:04
Subject: Re: terminating walsender process due to replication timeout
Hello.
Thanks for the answer.
Can frequent database operations cause getting a standby server behind? Is there a way to avoid this situation?
I checked that walsender works well in my test if I set wal_sender_timeout at least to 5 second.
Best regards,
Andrei Yahorau
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: rene.romero.b@gmail.com, pgsql-general@postgresql.org
Date: 16/05/2019 10:36
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by>
> Hello,
> Thank You for the response.
>
> Yes that's possible to monitor replication delay. But my questions were
> not about monitoring network issues.
>
> I use exactly wal_sender_timeout=1s because it allows to detect
> replication problems quickly.
Though I don't have an exact idea of your configuration, it seems
to me that your standby is simply getting behind more than one
second from the master. If you regard the fact as a problem of
replication, the configuration can be said to be finding the
problem correctly.
Since the keep-alive packet is sent in-band, it doesn't get to
the standby before already-sent-but-not-processed packets.
> So, I need clarification to the following questions:
> Is it possible to use exactly this configuration and be sure that it will
> be work properly.
> What did I do wrong? Should I correct my configuration somehow?
> Is this the same issue as mentioned here:
> https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
> ? If it is so, why I do I face this problem again?
It is not the same "problem". What was mentioned there is fast
network making the sender-side loop busy, which prevents
keepalive packet from sending.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I can simplify and describe the issue I faced.
I have 2 nodes in db cluster: master and standby.
I create a simple table on master node by a command via psql:
CREATE TABLE table1 (a INTEGER);
After this I fill the table by COPY command from a file which contains 2000000 (2 million) entries.
And in case when I run for example such a command:
UPDATE table1 SET a='1'
or such a command:
DELETE FROM table1;
I see in PostgreSQL log the an entry: terminating walsender process due to replication timeout.
I suppose that this issue caused by small value of wal_sender_timeout=1s and long runtime of the queries (it takes about 15 seconds).
What is the best way to proceed it? How to avoid this? Is there any additional configuration which can help here?
Regards,
Andrei
From: Andrei Yahorau/IBA
To: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 17/05/2019 11:04
Subject: Re: terminating walsender process due to replication timeout
Hello.
Thanks for the answer.
Can frequent database operations cause getting a standby server behind? Is there a way to avoid this situation?
I checked that walsender works well in my test if I set wal_sender_timeout at least to 5 second.
Best regards,
Andrei Yahorau
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: rene.romero.b@gmail.com, pgsql-general@postgresql.org
Date: 16/05/2019 10:36
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by>
> Hello,
> Thank You for the response.
>
> Yes that's possible to monitor replication delay. But my questions were
> not about monitoring network issues.
>
> I use exactly wal_sender_timeout=1s because it allows to detect
> replication problems quickly.
Though I don't have an exact idea of your configuration, it seems
to me that your standby is simply getting behind more than one
second from the master. If you regard the fact as a problem of
replication, the configuration can be said to be finding the
problem correctly.
Since the keep-alive packet is sent in-band, it doesn't get to
the standby before already-sent-but-not-processed packets.
> So, I need clarification to the following questions:
> Is it possible to use exactly this configuration and be sure that it will
> be work properly.
> What did I do wrong? Should I correct my configuration somehow?
> Is this the same issue as mentioned here:
> https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
> ? If it is so, why I do I face this problem again?
It is not the same "problem". What was mentioned there is fast
network making the sender-side loop busy, which prevents
keepalive packet from sending.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 23/5/19 5:05 μ.μ., AYahorau@ibagroup.eu wrote:
Hello Everyone!I have set mine to 15min. No problems for over 7 months, knock on wood.
I can simplify and describe the issue I faced.
I have 2 nodes in db cluster: master and standby.
I create a simple table on master node by a command via psql:
CREATE TABLE table1 (a INTEGER);
After this I fill the table by COPY command from a file which contains 2000000 (2 million) entries.
And in case when I run for example such a command:
UPDATE table1 SET a='1'
or such a command:
DELETE FROM table1;
I see in PostgreSQL log the an entry: terminating walsender process due to replication timeout.
I suppose that this issue caused by small value of wal_sender_timeout=1s and long runtime of the queries (it takes about 15 seconds).
What is the best way to proceed it? How to avoid this? Is there any additional configuration which can help here?
Regards,
Andrei
From: Andrei Yahorau/IBA
To: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 17/05/2019 11:04
Subject: Re: terminating walsender process due to replication timeout
Hello.
Thanks for the answer.
Can frequent database operations cause getting a standby server behind? Is there a way to avoid this situation?
I checked that walsender works well in my test if I set wal_sender_timeout at least to 5 second.
Best regards,
Andrei Yahorau
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: rene.romero.b@gmail.com, pgsql-general@postgresql.org
Date: 16/05/2019 10:36
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Wed, 15 May 2019 10:04:12 +0300, AYahorau@ibagroup.eu wrote in <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@iba.by>
> Hello,
> Thank You for the response.
>
> Yes that's possible to monitor replication delay. But my questions were
> not about monitoring network issues.
>
> I use exactly wal_sender_timeout=1s because it allows to detect
> replication problems quickly.
Though I don't have an exact idea of your configuration, it seems
to me that your standby is simply getting behind more than one
second from the master. If you regard the fact as a problem of
replication, the configuration can be said to be finding the
problem correctly.
Since the keep-alive packet is sent in-band, it doesn't get to
the standby before already-sent-but-not-processed packets.
> So, I need clarification to the following questions:
> Is it possible to use exactly this configuration and be sure that it will
> be work properly.
> What did I do wrong? Should I correct my configuration somehow?
> Is this the same issue as mentioned here:
> https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
> ? If it is so, why I do I face this problem again?
It is not the same "problem". What was mentioned there is fast
network making the sender-side loop busy, which prevents
keepalive packet from sending.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Hello. At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in <OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by> > Can frequent database operations cause getting a standby server behind? Is > there a way to avoid this situation? > I checked that walsender works well in my test if I set > wal_sender_timeout at least to 5 second. It depends on the transaction (WAL) traffic and the bandwidth of your network. The performacne difference between master and standby also affects. The possibilities I can guess are: - The bandwidth is narrow to the traffic. - The standby performs poorer than the master. - Your network is having a sort of trouble. Virtual network (local network in a virtual environment) tends to suffer network troubles caused by CPU saturation or something else. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello,
I took a look in postgresql source code. As far as I understood walsender can send some data to walreceiver regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per wal_sender_timeout seconds (once per 1 second in my case) and expects to get responses from the walreceiver that everything goes ok.
I switched on trace and repeated my test. I found out that walreceiver starts processing of the keepalive message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
So because of writing large volume of data it was not able to handle the next messages quickly. It seems not to be related to network bandwidth or CPU saturation.
Thereby, I see some kind of a contradiction with the official description of wal_sender_timeout parameter:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network outage.
During my test the connection between master and standby was active and there was no any network outage. So according to the description there was no need to terminate
replication connection.
So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response time to keepalive messages independent of the amount of data that the walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout documentation so it reflects reality?
Best regards,
Andrei
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 24/05/2019 09:42
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in <OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by>
> Can frequent database operations cause getting a standby server behind? Is
> there a way to avoid this situation?
> I checked that walsender works well in my test if I set
> wal_sender_timeout at least to 5 second.
It depends on the transaction (WAL) traffic and the bandwidth of
your network. The performacne difference between master and
standby also affects.
The possibilities I can guess are:
- The bandwidth is narrow to the traffic.
- The standby performs poorer than the master.
- Your network is having a sort of trouble. Virtual network
(local network in a virtual environment) tends to suffer
network troubles caused by CPU saturation or something else.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I took a look in postgresql source code. As far as I understood walsender can send some data to walreceiver regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per wal_sender_timeout seconds (once per 1 second in my case) and expects to get responses from the walreceiver that everything goes ok.
I switched on trace and repeated my test. I found out that walreceiver starts processing of the keepalive message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
So because of writing large volume of data it was not able to handle the next messages quickly. It seems not to be related to network bandwidth or CPU saturation.
Thereby, I see some kind of a contradiction with the official description of wal_sender_timeout parameter:
Terminate replication connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network outage.
During my test the connection between master and standby was active and there was no any network outage. So according to the description there was no need to terminate
replication connection.
So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response time to keepalive messages independent of the amount of data that the walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout documentation so it reflects reality?
Best regards,
Andrei
From: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To: AYahorau@ibagroup.eu,
Cc: pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date: 24/05/2019 09:42
Subject: Re: terminating walsender process due to replication timeout
Hello.
At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in <OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by>
> Can frequent database operations cause getting a standby server behind? Is
> there a way to avoid this situation?
> I checked that walsender works well in my test if I set
> wal_sender_timeout at least to 5 second.
It depends on the transaction (WAL) traffic and the bandwidth of
your network. The performacne difference between master and
standby also affects.
The possibilities I can guess are:
- The bandwidth is narrow to the traffic.
- The standby performs poorer than the master.
- Your network is having a sort of trouble. Virtual network
(local network in a virtual environment) tends to suffer
network troubles caused by CPU saturation or something else.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center