Home > mailing lists

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" - Mailing list pgsql-admin

From	Rui DeSousa
Subject	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Date	November 14, 2018 19:24:54
Msg-id	946FBD41-7BAF-4414-8FB9-C0F16A0680F6@crazybean.net Whole thread Raw
In response to	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
Responses	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" (Achilleas Mantzios <achill@matrix.gatewaynet.com>) Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" (Achilleas Mantzios <achill@matrix.gatewaynet.com>) Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device" (Alvaro Herrera <alvherre@2ndquadrant.com>)
List	pgsql-admin

Tree view

> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we
runno firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself
earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the
primary.We have : 
>
> # sysctl -a | grep -i keepaliv
> net.ipv4.tcp_keepalive_intvl = 75
> net.ipv4.tcp_keepalive_probes = 9
> net.ipv4.tcp_keepalive_time = 7200
>

Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped.
So,the WAL receiver should have corrected itself after that time.  For reference, I run terminating abandon session
within15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc. 

I haven’t used Postgres keep alive setting as I find the OS handles it just fine.

> Also in addition to what you say (netstat, tcpdump) if I detect such a case (even with the primary panic'ed -- yeah
thiswould take quite some nerves to do this) I will connect with gdb and take a stack trace to know what the worker is
doingand why doesn't it restart. 
>
>

Do a netstat -an; that will show you all the network connections and their current state.  If you do this on both
systemsyou should find corresponding entries/states for the replication stream.   

i.e. sample output:

Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address          Foreign Address        (state)
tcp4       0      0 10.10.10.1.50604          10.10.10.2.5432       ESTABLISHED

In your case; what I think might have happened is that upstream server would not have an entry or it would be in one of
theFIN states where the downstream server would have EST connection with the Send-Q backlog.  If the servers were
communicatingthen the upstream server would have responded with a reset packet to the downstream thus forcing the
sessionto terminate.  

Using root account you could have seen what was transpiring on the given connection; i.e.

tcpdump -i eth0 port 50604

pgsql-admin by date:

From: Achilleas Mantzios
Date: 14 November 2018, 11:31:16
Subject: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

From: Simon White
Date: 14 November 2018, 22:49:18
Subject: Automating pg_Dump on Windows 2016 Server

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" - Mailing list pgsql-admin

Previous

Next