Thread: Streaming replication connection break - unexpected EOF on standby connection

Hi,

  We are using PostgreSQL 10.1 and OS is Ubuntu 16.04.3 LTS, for primary as well as secondary database server.

We are using streaming replication (asynchronous). But there is frequent replication connection break between primary and secondary.

Database Log at Primary:

14:46:11.610  IST LOG:  unexpected EOF on standby connection
14:46:11.610  IST LOG:  disconnection: session time: 2:59:34.520 user=replication_user

Database Log at Secondary

14:46:38.822 IST FATAL:  terminating walreceiver due to timeout

After replication disconnection, secondary get connected after approximately 25 seconds. We want to set synchronous replication, but because of this issue we can't do that.

I have enabled debug5 log, but I didn't understand much from that. I am attaching the same.

Below are the config parameter

Primary

wal_level = replica
max_wal_senders = 10     
                         
wal_keep_segments = 300  
wal_sender_timeout = 60s 

checkpoint_timeout = 5min
max_wal_size = 1GB
min_wal_size = 80MB


Secondary

hot_standby = on
wal_receiver_timeout = 30s

Below are the issues.

1. Because of what reason, " unexpected EOF on standby connection" occurs on primary db server?
2. After replication disconnection, secondary should immediately connect to primary, but it takes some time, what could be the reason for this?

Any help will be highly appreciated.

Thanks & Regards,
Ganesh.
Attachment

Re: Streaming replication connection break - unexpected EOF onstandby connection

From
Johannes Truschnigg
Date:
Hi Ganesh,


On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
> [...]
> 1. Because of what reason, " unexpected EOF on standby connection" occurs
> on primary db server?
> 2. After replication disconnection, secondary should immediately connect to
> primary, but it takes some time, what could be the reason for this?

From skimming the log, it seems to me that there is an issue at the
socket/network level, which yields the "connection reset by peer" eror
message.

What is the network between these two hosts like? Is it a WAN link; is a VPN
or SSH tunnel involved? Do you have other, long-running TCP sessions between
these peers, and do they experience similar or other problems? Do the hosts'
link-layer stats hint at problems, e. g. packet loss? Do the hosts' kernels
leave a message hinting at L2 connectivity problems in their debug ringbuffers
(`dmesg`) at the time you observe the replication drop out?

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

Attachment
Hi  Johannes,

  Thanks for your reply. We are using VPN Tunnel between these two hosts. I will check with network team, with remaining questions you mentioned and will get back.

Thanks  & Regards,
Ganesh.

On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
Hi Ganesh,


On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
> [...]
> 1. Because of what reason, " unexpected EOF on standby connection" occurs
> on primary db server?
> 2. After replication disconnection, secondary should immediately connect to
> primary, but it takes some time, what could be the reason for this?

From skimming the log, it seems to me that there is an issue at the
socket/network level, which yields the "connection reset by peer" eror
message.

What is the network between these two hosts like? Is it a WAN link; is a VPN
or SSH tunnel involved? Do you have other, long-running TCP sessions between
these peers, and do they experience similar or other problems? Do the hosts'
link-layer stats hint at problems, e. g. packet loss? Do the hosts' kernels
leave a message hinting at L2 connectivity problems in their debug ringbuffers
(`dmesg`) at the time you observe the replication drop out?

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.
Hi,

    After analysis by network team, they found packets are getting reset by Secondary server. Below are the logs.

782.822280 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822310 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822313 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822315 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822317 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822319 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822345 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740


But they didn't able to find why secondary generating reset packet. There are no any devices between these servers which can modify the packets.
Though both servers are on different firewall, but packets are getting reset at secondary server and not at the firewall level, we can see this in the log.

Below points I would like to mention about application 
1. This connection interruption happens in day time, when transactions are little bit high. In day time, average transactions per second are 5 (Inserts and processing). 
2. We are not using connection pool, so each time request comes app server creates new connection to db server and when processing is done, app server disconnects. 

We are now clue less why secondary server resetting the packets.  Any help is highly appreciated. 

Thanks & Regards,
Ganesh.




On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde <ganeshakorde@gmail.com> wrote:
Hi  Johannes,

  Thanks for your reply. We are using VPN Tunnel between these two hosts. I will check with network team, with remaining questions you mentioned and will get back.

Thanks  & Regards,
Ganesh.

On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
Hi Ganesh,


On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
> [...]
> 1. Because of what reason, " unexpected EOF on standby connection" occurs
> on primary db server?
> 2. After replication disconnection, secondary should immediately connect to
> primary, but it takes some time, what could be the reason for this?

From skimming the log, it seems to me that there is an issue at the
socket/network level, which yields the "connection reset by peer" eror
message.

What is the network between these two hosts like? Is it a WAN link; is a VPN
or SSH tunnel involved? Do you have other, long-running TCP sessions between
these peers, and do they experience similar or other problems? Do the hosts'
link-layer stats hint at problems, e. g. packet loss? Do the hosts' kernels
leave a message hinting at L2 connectivity problems in their debug ringbuffers
(`dmesg`) at the time you observe the replication drop out?

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

Hi Ganesh,

the logs you posted refer to timeouts in the connections.

Your configuration tells that in case of network drop, the standby server will be the first to acknowledge it. That's because wal_receiver_timeout is < than wal_sender_timeout

From the documentation:

wal_receiver_timeout (integer)

Terminate replication connections that are inactive longer than the specified number of milliseconds. This is useful for the receiving standby server to detect a primary node crash or network outage. A value of zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or on the server command line. The default value is 60 seconds.


That might explain why your secondary server calls for a RST.

RST packages you posted are more a consequence, than a cause of your problem.

I think that the RST is sent to acknowledge master that the connection should be closed due to timeout.

What above goes together with the fact that the sequence number 1232664740 of the RST packet is retransmitted several times, meaning that it did not reach its destination at first.

I would look more carefully to your network because I suspect the real problem might be there.


regards,

fabio pardi




On 03/07/18 09:36, Ganesh Korde wrote:
Hi,

    After analysis by network team, they found packets are getting reset by Secondary server. Below are the logs.

782.822280 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822310 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822313 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822315 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822317 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822319 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822345 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740


But they didn't able to find why secondary generating reset packet. There are no any devices between these servers which can modify the packets.
Though both servers are on different firewall, but packets are getting reset at secondary server and not at the firewall level, we can see this in the log.

Below points I would like to mention about application 
1. This connection interruption happens in day time, when transactions are little bit high. In day time, average transactions per second are 5 (Inserts and processing). 
2. We are not using connection pool, so each time request comes app server creates new connection to db server and when processing is done, app server disconnects. 

We are now clue less why secondary server resetting the packets.  Any help is highly appreciated. 

Thanks & Regards,
Ganesh.




On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde <ganeshakorde@gmail.com> wrote:
Hi  Johannes,

  Thanks for your reply. We are using VPN Tunnel between these two hosts. I will check with network team, with remaining questions you mentioned and will get back.

Thanks  & Regards,
Ganesh.

On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
Hi Ganesh,


On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
> [...]
> 1. Because of what reason, " unexpected EOF on standby connection" occurs
> on primary db server?
> 2. After replication disconnection, secondary should immediately connect to
> primary, but it takes some time, what could be the reason for this?

From skimming the log, it seems to me that there is an issue at the
socket/network level, which yields the "connection reset by peer" eror
message.

What is the network between these two hosts like? Is it a WAN link; is a VPN
or SSH tunnel involved? Do you have other, long-running TCP sessions between
these peers, and do they experience similar or other problems? Do the hosts'
link-layer stats hint at problems, e. g. packet loss? Do the hosts' kernels
leave a message hinting at L2 connectivity problems in their debug ringbuffers
(`dmesg`) at the time you observe the replication drop out?

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

Hi,
     
   Finally issue has been resolved and issue was with network. There were two issues as below.

1. We have two links between primary and secondary, VPN tunnel and  L2TP.  Multiple routes were configured for VPN and  L2TP from the Secondary to                 primary. Tunnel and link was always up.  But, ss the source from Secondary is coming through tunnel  towards Primary the connection was getting dropped         after  reaching the destination due to ambiguity on routes between L2TP and VPN tunnel. This issue has been resolved by allowing access via VPN tunnel           removing  L2TP route.

2. After fixing the above, still disconnection was happening, but after specific time interval. It was due to certain negotiation time set IPSEC tunnel.
So they have enabled auto negotiation on the IPSEC tunnel so it won’t wait from tunnel initiation from other end.

So issues related disconnection has been resolved.

Thanks Johannes and Fabio for your help.

Regards,
Ganesh.

On Tue, Jul 3, 2018 at 5:37 PM Fabio Pardi <f.pardi@portavita.eu> wrote:

Hi Ganesh,

the logs you posted refer to timeouts in the connections.

Your configuration tells that in case of network drop, the standby server will be the first to acknowledge it. That's because wal_receiver_timeout is < than wal_sender_timeout

From the documentation:

wal_receiver_timeout (integer)

Terminate replication connections that are inactive longer than the specified number of milliseconds. This is useful for the receiving standby server to detect a primary node crash or network outage. A value of zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or on the server command line. The default value is 60 seconds.


That might explain why your secondary server calls for a RST.

RST packages you posted are more a consequence, than a cause of your problem.

I think that the RST is sent to acknowledge master that the connection should be closed due to timeout.

What above goes together with the fact that the sequence number 1232664740 of the RST packet is retransmitted several times, meaning that it did not reach its destination at first.

I would look more carefully to your network because I suspect the real problem might be there.


regards,

fabio pardi




On 03/07/18 09:36, Ganesh Korde wrote:
Hi,

    After analysis by network team, they found packets are getting reset by Secondary server. Below are the logs.

782.822280 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822310 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822313 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822315 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822317 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822319 wan2 out <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740

782.822345 port7 in <Secondary_server_IP>.35918 -> <Primary_server_IP>.5433: rst 1232664740


But they didn't able to find why secondary generating reset packet. There are no any devices between these servers which can modify the packets.
Though both servers are on different firewall, but packets are getting reset at secondary server and not at the firewall level, we can see this in the log.

Below points I would like to mention about application 
1. This connection interruption happens in day time, when transactions are little bit high. In day time, average transactions per second are 5 (Inserts and processing). 
2. We are not using connection pool, so each time request comes app server creates new connection to db server and when processing is done, app server disconnects. 

We are now clue less why secondary server resetting the packets.  Any help is highly appreciated. 

Thanks & Regards,
Ganesh.




On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde <ganeshakorde@gmail.com> wrote:
Hi  Johannes,

  Thanks for your reply. We are using VPN Tunnel between these two hosts. I will check with network team, with remaining questions you mentioned and will get back.

Thanks  & Regards,
Ganesh.

On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
Hi Ganesh,


On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
> [...]
> 1. Because of what reason, " unexpected EOF on standby connection" occurs
> on primary db server?
> 2. After replication disconnection, secondary should immediately connect to
> primary, but it takes some time, what could be the reason for this?

From skimming the log, it seems to me that there is an issue at the
socket/network level, which yields the "connection reset by peer" eror
message.

What is the network between these two hosts like? Is it a WAN link; is a VPN
or SSH tunnel involved? Do you have other, long-running TCP sessions between
these peers, and do they experience similar or other problems? Do the hosts'
link-layer stats hint at problems, e. g. packet loss? Do the hosts' kernels
leave a message hinting at L2 connectivity problems in their debug ringbuffers
(`dmesg`) at the time you observe the replication drop out?

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

Yeah, well done!

Thanks for letting us know and glad my tips helped!

regards,

fabio pardi

On 07/17/2018 10:10 AM, Ganesh Korde wrote:
> Hi,
>      
>    Finally issue has been resolved and issue was with network. There
> were two issues as below.
> 
> 1. We have two links between primary and secondary, VPN tunnel and 
> L2TP.  Multiple routes were configured for VPN and  L2TP from the
> Secondary to                 primary. Tunnel and link was always up. 
> But, ss the source from Secondary is coming through tunnel  towards
> Primary the connection was getting dropped         after  reaching the
> destination due to ambiguity on routes between L2TP and VPN tunnel. This
> issue has been resolved by allowing access via VPN tunnel         
>  removing  L2TP route.
> 
> 2. After fixing the above, still disconnection was happening, but after
> specific time interval. It was due to certain negotiation time set IPSEC
> tunnel.
> So they have enabled auto negotiation on the IPSEC tunnel so it won’t
> wait from tunnel initiation from other end.
> 
> So issues related disconnection has been resolved.
> 
> Thanks Johannes and Fabio for your help.
> 
> Regards,
> Ganesh.
> 
> On Tue, Jul 3, 2018 at 5:37 PM Fabio Pardi <f.pardi@portavita.eu
> <mailto:f.pardi@portavita.eu>> wrote:
> 
>     Hi Ganesh,
> 
>     the logs you posted refer to timeouts in the connections.
> 
>     Your configuration tells that in case of network drop, the standby
>     server will be the first to acknowledge it. That's because
>     wal_receiver_timeout is < than wal_sender_timeout
> 
>     From the documentation:
> 
>     |wal_receiver_timeout| (|integer|)
> 
>         Terminate replication connections that are inactive longer than
>         the specified number of milliseconds. This is useful for the
>         receiving standby server to detect a primary node crash or
>         network outage. A value of zero disables the timeout mechanism.
>         This parameter can only be set in the |postgresql.conf| file or
>         on the server command line. The default value is 60 seconds.
> 
> 
>     That might explain why your secondary server calls for a RST.
> 
>     RST packages you posted are more a consequence, than a cause of your
>     problem.
> 
>     I think that the RST is sent to acknowledge master that the
>     connection should be closed due to timeout.
> 
>     What above goes together with the fact that the sequence number
>     1232664740 of the RST packet is retransmitted several times, meaning
>     that it did not reach its destination at first.
> 
>     I would look more carefully to your network because I suspect the
>     real problem might be there.
> 
> 
>     regards,
> 
>     fabio pardi
> 
> 
> 
> 
>     On 03/07/18 09:36, Ganesh Korde wrote:
>>     Hi,
>>
>>         After analysis by network team, they found packets are getting
>>     reset by Secondary server. Below are the logs.
>>
>>     782.822280 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822310 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822313 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822315 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822317 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822319 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822345 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>
>>     But they didn't able to find why secondary generating reset
>>     packet. There are no any devices between these servers which can
>>     modify the packets.
>>     Though both servers are on different firewall, but packets are
>>     getting reset at secondary server and not at the firewall level,
>>     we can see this in the log.
>>
>>     Below points I would like to mention about application 
>>     1. This connection interruption happens in day time, when
>>     transactions are little bit high. In day time, average
>>     transactions per second are 5 (Inserts and processing). 
>>     2. We are not using connection pool, so each time request comes
>>     app server creates new connection to db server and when processing
>>     is done, app server disconnects. 
>>
>>     We are now clue less why secondary server resetting the packets. 
>>     Any help is highly appreciated. 
>>
>>     Thanks & Regards,
>>     Ganesh.
>>
>>
>>
>>
>>     On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde
>>     <ganeshakorde@gmail.com <mailto:ganeshakorde@gmail.com>> wrote:
>>
>>         Hi  Johannes,
>>
>>           Thanks for your reply. We are using VPN Tunnel between these
>>         two hosts. I will check with network team, with remaining
>>         questions you mentioned and will get back.
>>
>>         Thanks  & Regards,
>>         Ganesh.
>>
>>         On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg
>>         <johannes@truschnigg.info <mailto:johannes@truschnigg.info>>
>>         wrote:
>>
>>             Hi Ganesh,
>>
>>
>>             On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
>>             > [...]
>>             > 1. Because of what reason, " unexpected EOF on standby
>>             connection" occurs
>>             > on primary db server?
>>             > 2. After replication disconnection, secondary should
>>             immediately connect to
>>             > primary, but it takes some time, what could be the
>>             reason for this?
>>
>>             From skimming the log, it seems to me that there is an
>>             issue at the
>>             socket/network level, which yields the "connection reset
>>             by peer" eror
>>             message.
>>
>>             What is the network between these two hosts like? Is it a
>>             WAN link; is a VPN
>>             or SSH tunnel involved? Do you have other, long-running
>>             TCP sessions between
>>             these peers, and do they experience similar or other
>>             problems? Do the hosts'
>>             link-layer stats hint at problems, e. g. packet loss? Do
>>             the hosts' kernels
>>             leave a message hinting at L2 connectivity problems in
>>             their debug ringbuffers
>>             (`dmesg`) at the time you observe the replication drop out?
>>
>>             -- 
>>             with best regards:
>>             - Johannes Truschnigg ( johannes@truschnigg.info
>>             <mailto:johannes@truschnigg.info> )
>>
>>             www:   https://johannes.truschnigg.info/
>>             phone: +43 650 2 133337
>>             xmpp:  johannes@truschnigg.info
>>             <mailto:johannes@truschnigg.info>
>>
>>             Please do not bother me with HTML-email or attachments.
>>             Thank you.
>>
>