PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdowncheckpoint - Mailing list pgsql-admin

From Achilleas Mantzios
Subject PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdowncheckpoint
Date
Msg-id 0d25bb8c-9cc7-10bc-71ba-6a8a8c33ec49@matrix.gatewaynet.com
Whole thread Raw
Responses Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint  (Rui DeSousa <rui@crazybean.net>)
Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-admin
On Sunday morning I started getting email warnings about running out of space on our pg_wal dir (which is a symlink to
aseparate disk).
 

As the size was about to reach 100% of the disk (100GB) I shutdown the database in order to investigate without the
pressureof a potentially worse situation. Only to see that the restart solved the 
 
problem. I had not kept track of the total number of files in pg_wal during the previous days, but normally they are
about550->600 files (see relevant settings below), so since the info on the log 
 
regarding checkpoints refer to the changes done by those checkpoints I am not sure about the number of files before the
85%threshold at which our monitoring system starts to send out alerts. But 
 
doing the calculations this 85% is reached when the number of files exceeds 5000.
The removal of wal files (according to checkpoint log entries) stopped some time in late Friday.

  [782] 5bbc6127.30e 2018-11-02 22:59:20.015 EET  @ line:5560 LOG: checkpoint complete: wrote 7758 buffers (0.7%); 0
WALfile(s) added, 5 removed, 21 recycled; write=777.496 s, sync=0.006 s, 
 
total=777.554 s; sync files=152, longest=0.002 s, average=0.000 s; distance=108549 kB, estimate=412619 kB
  [782] 5bbc6127.30e 2018-11-02 23:13:14.493 EET  @ line:5562 LOG: checkpoint complete: wrote 7104 buffers (0.7%); 0
WALfile(s) added, 13 removed, 18 recycled; write=712.311 s, sync=0.007 s, 
 
total=712.379 s; sync files=190, longest=0.001 s, average=0.000 s; distance=473908 kB, estimate=473908 kB
  [782] 5bbc6127.30e 2018-11-02 23:26:10.157 EET  @ line:5564 LOG: checkpoint complete: wrote 7623 buffers (0.7%); 0
WALfile(s) added, 0 removed, 20 recycled; write=638.089 s, sync=0.006 s, 
 
total=638.152 s; sync files=245, longest=0.006 s, average=0.000 s; distance=716742 kB, estimate=716742 kB
  [782] 5bbc6127.30e 2018-11-02 23:41:24.253 EET  @ line:5566 LOG: checkpoint complete: wrote 7277 buffers (0.7%); 0
WALfile(s) added, 0 removed, 0 recycled; write=652.984 s, sync=0.002 s, 
 
total=653.014 s; sync files=195, longest=0.001 s, average=0.000 s; distance=326408 kB, estimate=677709 kB
  [782] 5bbc6127.30e 2018-11-02 23:47:25.473 EET  @ line:5568 LOG: checkpoint complete: wrote 1138 buffers (0.1%); 0
WALfile(s) added, 0 removed, 0 recycled; write=114.098 s, sync=0.002 s, 
 
total=114.110 s; sync files=189, longest=0.002 s, average=0.000 s; distance=84617 kB, estimate=618400 kB

No deletions happened in Saturday, neither in Sunday till the shutdown. Then shutdown gave :
  [782] 5bbc6127.30e 2018-11-04 10:49:55.571 EET  @ line:5916 LOG: checkpoint starting: shutdown immediate
....
  [782] 5bbc6127.30e 2018-11-04 10:50:04.838 EET  @ line:5917 LOG: checkpoint complete: wrote 2206 buffers (0.2%); 0
WALfile(s) added, 5128 removed, 105 recycled; write=0.038 s, sync=0.023 s, 
 
total=9.268 s; sync files=129, longest=0.014 s, average=0.000 s; distance=261973 kB, estimate=620287 kB

The removal of those 5128 files freed 80GB of space. (and the used space is still low as we speak after about 180GB of
trafficsince the incident).
 

Our current settings are :

wal_keep_segments = 512
max_wal_size = 2GB
min_wal_size = 1GB

Our setup is as follows :

primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot)
barman
                 ^--> (replication slot) logical subscriber (testsmadb)
                 ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )

I checked for any errors in communications, delays, but could not find any. No ERRORs indication anything with the
archivecommand in the logs, or any abnormal lag in either the physical standby or 
 
the logical subscriber or any replication slot. And none of those two servers had any outages. We have two alternate
methodsof backups, one old with pg_start/stop_backup and one with barman. I 
 
checked that all backups successfully were completed with pg_stop_backup. (barman does not interface the primary)

This is the first time we are facing this. No change happened in the load or configuration since the deployment of 10
in1st July, besides the upgrade to 10.5 (in September)
 

Any thoughts on this?

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



pgsql-admin by date:

Previous
From: Rui DeSousa
Date:
Subject: Re: size of table + toasted tables + indexes !=pg_total_relation_size
Next
From: pavan95
Date:
Subject: Re: How to revoke "Create Privilege" from a readonly user inpostgres?