Home > mailing lists

BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves) - Mailing list pgsql-hackers

From	Jehan-Guillaume de Rorthais
Subject	BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Date	September 18, 2014 16:08:01
Msg-id	20140918180734.361021e1@erg Whole thread
In response to	Re: .ready files appearing on slaves (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses	Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
List	pgsql-hackers

Tree view

Hi hackers,

We spend some time with Guillaume Lelarge studying this issue.

CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This
one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped.
As our slave has "archive_mode" & "archive_command" set, XLogArchivingActive()
returns true, forcing XLogArchiveCheckDone() to look for a ".done" file. If the
corresponding ".done" file doesn't exist, XLogArchiveCheckDone() calls
XLogArchiveNotify() to actually create the ".ready" file.

Now, on a standby, we are supposed to create the ".done" files by calling
XLogArchiveForceDone() either after a successful restore_command or when the
streaming replication finishes and flushes a WAL. It seems like the root cause
of this trouble: maybe a race condition where XLogArchiveForceDone() would not
be called for a WAL. As the ".done" file is never created, we end up calling
XLogArchiveNotify() to create a ".ready" and this WAL will never be archived by
the slave, sitting in the pg_xlog dir.

But worst, we experience this under two 9.2.8 clusters and realized
this version is not supposed to have 256 segments per WAL, finishing a WAL to
the *FE one. I'm quoting back the history of my previous mail as we clearly
have some *FF files on the slave side!

> >   000000010000040E000000FF
> >   0000000100000414000000DA
> >   000000010000046E000000FF
> >   0000000100000470000000FF
> >   00000001000004850000000F
> >   000000010000048500000010
> >   [...normal WAL sequence...]
> >   000000010000048500000052

Note that it seems to happen only during streaming replication, not in pure Log
Shipping replication. We added the debug message on the master side to make
sure it never archive *FF file

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Regards,

On Mon, 15 Sep 2014 17:37:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Hi hackers,
> 
> An issue that seems related to this has been posted on pgsql-admin. See:
> 
>   http://www.postgresql.org/message-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com
> 
> How can we help on this issue?
> 
> Cheers,
> 
> On Thu, 4 Sep 2014 17:50:36 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> 
> > Hi hackers,
> > 
> > Since few months, we occasionally see .ready files appearing on some slave
> >  instances from various context. The two I have in mind are under 9.2.x.
> > 
> > I tried to investigate a bit. These .ready files are created when a WAL file
> > from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
> > easily experience this by deleting such a file: it is created again at the
> > next restartpoint or checkpoint received from the master.
> > 
> > Looking at the WAL in pg_xlog folder corresponding to these .ready files,
> > they are all much older than the current WAL "cycle" in both mtime and name
> > logic sequence. As instance on one of these box we have currently 6 of
> > those "ghost" WALs:
> > 
> >   0000000200001E53000000FF
> >   0000000200001F18000000FF
> >   0000000200002047000000FF
> >   00000002000020BF000000FF
> >   0000000200002140000000FF
> >   0000000200002370000000FF
> >   000000020000255D000000A8
> >   000000020000255D000000A9
> >   [...normal WAL sequence...]
> >   000000020000255E0000009D
> > 
> > And on another box:
> > 
> >   000000010000040E000000FF
> >   0000000100000414000000DA
> >   000000010000046E000000FF
> >   0000000100000470000000FF
> >   00000001000004850000000F
> >   000000010000048500000010
> >   [...normal WAL sequence...]
> >   000000010000048500000052
> > 
> > So it seems for some reasons, these old WALs were "forgotten" by the
> > restartpoint mechanism when they should have been recylced/deleted. 
> > 
> > For one of these servers, I could correlate this with some brutal
> > disconnection of the streaming replication appearing in its logs. But there
> > was no known SR disconnection on the second one.
> > 
> > Any idea about this weird behaviour? What can we do to help you investigate
> > further?
> > 
> > Regards,
> 
> 
> 

-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

pgsql-hackers by date:

From: Michael Paquier
Date: 18 September 2014, 15:21:16
Subject: Re: Options OUTPUT_PLUGIN_* controlling format are confusing (Was: Misleading error message in logical decoding)

From: Josh Berkus
Date: 18 September 2014, 16:54:02
Subject: Re: jsonb format is pessimal for toast compression

BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves) - Mailing list pgsql-hackers

Previous

Next