On 2013-12-29 02:48:21 -0500, Tom Lane wrote:
> 4. The server tries to start, and fails because it can't find a WAL file
> containing the last checkpoint record. This is pretty unsurprising given
> the facts above. The reason you don't see any "no such file" report is
> that XLogFileRead() will report any BasicOpenFile() failure *other than*
> ENOENT. And nothing else makes up for that.
>
> Re point 4: the logic, if you can call it that, in xlog.c and xlogreader.c
> is making my head spin. There are about four levels of overcomplicated
> and undercommented code before you ever get down to XLogFileRead, so I
> have no idea which level to blame for the lack of error reporting in this
> specific case. But there are pretty clearly some cases in which ignoring
> ENOENT in XLogFileRead isn't such a good idea, and XLogFileRead isn't
> being told when to do that or not.
Yes, that code is pretty horrid. To Heikki's and my defense, I don't
think the xlogreader.c split had much to do with it tho. I think the
path erroring out essentially is
ReadRecord()->XLogReadRecord()*->ReadPageInternal()*->XLogPageRead()
->WaitForWALToBecomeAvailable()->XLogFileReadAnyTLI()->XLogFileRead()
The *ed functions are new, but it's really code that was in ReadRecord()
before. So I don't think too much has changed since 9.0ish, although the
timeline switch didn't make it simpler.
As far as I can tell XLogFileRead() actually is told when it's ok to
ignore an error - the notfoundOK parameter. It's just that we're always
passing true for it we're not streaming...
I think it might be sufficient to make passing that flag additionally
conditional on fetching_ckpt, that's already passed to
WaitForWALToBecomeAvailable(), so we'd just need to add it to
XLogFileReadAnyTLI().
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services