Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves) - Mailing list pgsql-hackers

From Dennis Kögel
Subject Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Date
Msg-id 8C204AC4-83F7-4B59-ABA6-4D9FE27670FF@neveragain.de
Whole thread Raw
In response to Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:
> Since few months, we occasionally see .ready files appearing on some slave
> instances from various context. The two I have in mind are under 9.2.x. […]
> So it seems for some reasons, these old WALs were "forgotten" by the
> restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:
> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
> 2. Why are the .done files sometimes not being created?



We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there
isan odd piling up of old WALs and .ready files in pg_xlog, going back several months. 

The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so
herewe go: 

Environment:

Master,      9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)


- On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a
diskspace problem) 

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the
StreamingReplication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the
opposite

- We observe files named *FF as well


As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL
namegaps (?). 

I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s
severaldays between the WAL and the .ready file. 

- Master:   http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4


I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing
wascommitted. 
If there’s any way I can help, please let me know.

- D.


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Next
From: Arthur Silva
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes