Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" - Mailing list pgsql-admin

From Achilleas Mantzios
Subject Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Date
Msg-id e7581f2a-5c1b-a3be-7b22-88dea4c5aeeb@matrix.gatewaynet.com
Whole thread Raw
In response to Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
Responses Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
List pgsql-admin
On 15/11/18 4:48 μ.μ., Rui DeSousa wrote:


On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:

On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier).  Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :

# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.  So, the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session within 15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.

Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I didn't keep the whole output of the pg_stat_subscription (which I took on Sunday morning). And in Sunday morning the primary, having run out of space, PANIC'ed .

Apparently TCP keep alive is not enabled; looks like we’re missing the following but don’t look for it :).  So the connection would stick around until it gets a reset packet which could be days if not longer especially if packets are being dropped.  I know you say there is no firewall; but the upstream server would send a reset packet if connection was terminated and/or Postgres was down — so there is something dropping packets.

net.inet.tcp.always_keepalive=1

This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have no effect at all to the postgresql settings, they remained all three of them at zero. This is completely irrelevant with my problem but anyway.


A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters that govern TCP keepalive in Linux, which in both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#


The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

So from looks of this, keep alive is enabled. (Also don't confuse WAL receiver with logical worker, different programs, albeit similar).

People any other theories as to why the subscriber kept acting like the publisher was there, while it wasn't?
Is there any way (by network means?) to mock this behavior in order to fool the replication worker like the sender is there?





-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

pgsql-admin by date:

Previous
From: pgsql-admin@kolttonen.fi
Date:
Subject: Re: The current shape of PG master-slave replication
Next
From: srinivas oguri
Date:
Subject: why optimizer not choosing correct index (btree vs btree_gin)