Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" - Mailing list pgsql-admin

From Rui DeSousa
Subject Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Date
Msg-id DEE4392D-7063-4310-BFC6-EEE791D2D006@crazybean.net
Whole thread Raw
In response to Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
List pgsql-admin


On Nov 17, 2018, at 3:47 PM, Rui DeSousa <rui@crazybean.net> wrote:



On Nov 17, 2018, at 6:07 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.


Postgres supports it; but the question is it on for the given connection? 


I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.



I just configured Postgres with streaming replication using the following versions and TCP keep alive was enabled by default for the WAL receiver connection and also psql connections.

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
PostgreSQL 10.6 (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

root@debian:~# netstat -anp --timers | grep -e Timer -e  EST | grep -e Timer -e 5432
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 10.6.6.101:47546        10.6.6.100:5432         ESTABLISHED 989/telnet           off (0.00/0/0)
tcp        0      0 10.6.6.101:47544        10.6.6.100:5432         ESTABLISHED 953/psql             keepalive (7103.36/0/0)
tcp        0      0 10.6.6.101:47542        10.6.6.100:5432         ESTABLISHED 922/postgres: 10/ma  keepalive (7088.03/0/0)


As you can see from above; telnet does not enable keep alive on the connection.  I would check the troubled system with the above netstat command to verify that keep alive is in fact enabled on the WAL receiver connection.

If it’s enabled the connection should have terminated after the 18 hours and hopefully less now with your new setting.  I have no idea why it wouldn’t terminate and reconnect other than tcp keep live is either off or a bug in Linux/Postgres.


 

I will also add that testing this configuration worked as expected.  When disconnecting the primary node from the network; the Postgres processes where able to detect the fault and WAL receiver reconnected as soon as the primary was back on line without issue.

What is the version of Postgres and OS are you using?

pgsql-admin by date:

Previous
From: Rui DeSousa
Date:
Subject: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Next
From: Achilleas Mantzios
Date:
Subject: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"