Thread: Logical wal receiver (background worker) not detecting when publishernode has died

Logical wal receiver (background worker) not detecting when publishernode has died

From
Achilleas Mantzios
Date:
Dear List,
Coming back from : 
https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com 


and having got absolutely no helpful answer from our infrastructure 
people, I would like to ask :

Is it on earth possible that the primary (publisher node) has crushed 
while on the subscriber node the logical wal receiver goes on happily 
like there is no problem at all, no messages in log, no timeouts, acts 
if nothing happen ?

(in the meantime the second standby instantly detected the crush of the 
primary and immediately restarted re-connection attempts)




> On Nov 23, 2018, at 12:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
> Dear List,
> Coming back from : https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com

>
> and having got absolutely no helpful answer from our infrastructure people, I would like to ask :
>
> Is it on earth possible that the primary (publisher node) has crushed while on the subscriber node the logical wal
receivergoes on happily like there is no problem at all, no messages in log, no timeouts, acts if nothing happen ? 
>
> (in the meantime the second standby instantly detected the crush of the primary and immediately restarted
re-connectionattempts) 
>

Not that I can think of without it being a bug.

If it happens again; you can try killing the WAL receiver session via Postgres and if that fails then using tcpkill to
terminatethe session. 

It would be good to know the actual cause though and collect as much information before terminating the session.

Interested; but not sure if it’s related: https://www.evanjones.ca/tcp-stuck-connection-mystery.html





> On Nov 23, 2018, at 8:00 PM, Rui DeSousa <rui@crazybean.net> wrote:
>
>
>
>> On Nov 23, 2018, at 12:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>>
>> Dear List,
>> Coming back from :
https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com 
>>
>> and having got absolutely no helpful answer from our infrastructure people, I would like to ask :
>>
>> Is it on earth possible that the primary (publisher node) has crushed while on the subscriber node the logical wal
receivergoes on happily like there is no problem at all, no messages in log, no timeouts, acts if nothing happen ? 
>>
>> (in the meantime the second standby instantly detected the crush of the primary and immediately restarted
re-connectionattempts) 
>>
>
> Not that I can think of without it being a bug.
>
> If it happens again; you can try killing the WAL receiver session via Postgres and if that fails then using tcpkill
toterminate the session. 
>
> It would be good to know the actual cause though and collect as much information before terminating the session.
>
> Interested; but not sure if it’s related: https://www.evanjones.ca/tcp-stuck-connection-mystery.html
>
>
>


Same problem but no solution; keep alive not working.

https://superuser.com/questions/1021988/connection-remains-flagged-as-established-even-if-host-is-unconnected