There appears to be some issues or at least a lack of clarity with the
way we zero damaged/missing pages in various circumstances.
1. If we have an invalid page header then we execute the code below
(from bufmgr.c ReadBuffer()).. In the case that we are only reading the
page, not writing to it, zero_damaged_pages causes the page to be
overwritten with zeroes in memory, yet does not set the buffer BM_DIRTY.
The page in memory is zeroed, but the on-disk copy is never fully zeroed
because the page is never flushed back onto the database (unless a
subsequent write does dirty the page). This behaviour is somewhat
desirable from a support perspective, but isn't fully documented, so
ad-hoc attempts to solve problems appear to have failed. VACUUM *does*
write pages that it has "fixed" to disk, so lazy_scan_heap() operates
differently to a SELECT.
A patch prototype to make zero_damaged_pages work as advertised is
enclosed, though the current behaviour may well be preferred, in which
case a doc patch is more appropriate.
However, since autovacuum the window of opportunity for support to
assist with data recovery is smaller and somewhat random. VACUUM can
come along at any time and destroy the remaining data on the corrupt
page, destroying any benefit from the current behaviour. Also, since
some damaged pages will be written to and some not, we have a fairly
random amount of data loss, yet the logs don't differentiate between a
page that has been zeroed-and-written-to-disk and a page that has been
zeroed-and-ignored.
Perhaps we might want a new parameter called ignore_damaged_pages, so
that VACUUM would also ignore the bad data? At the very least we should
have a doc patch explaining the current behaviour better so that people
understand what to do when messages start appearing in the logs?
in bufmgr.c ReadBuffer() we have this code starting line 240
if (!PageHeaderIsValid((PageHeader) bufBlock))
{
/*
* During WAL recovery, the first access to any data page should
* overwrite the whole page from the WAL; so a clobbered page
* header is not reason to fail. Hence, when InRecovery we may
* always act as though zero_damaged_pages is ON.
*/
if (zero_damaged_pages || InRecovery)
{
ereport(WARNING,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("invalid page header in block %u of relation \"%s\"; zeroing
out page",
blockNum, RelationGetRelationName(reln))));
MemSet((char *) bufBlock, 0, BLCKSZ);
}
else
...
}
if (isLocalBuf)
{
/* Only need to adjust flags */
bufHdr->flags |= BM_VALID | setdirty;
}
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
TerminateBufferIO(bufHdr, false, BM_VALID | setdirty);
}
So the above code writes to a page, yet doesn't set BM_DIRTY. (It
doesn't unset it either, but thats not exactly the same thing).
2. Also from the code, the assumption that InRecovery means we can
ignore corrupt pages is only true when rolling forward on a set of logs
that contains full_page_writes blocks. That seems like an invalid
assumption when the page being accessed isn't going to be fully
overwritten, as in the case when we set full_page_writes = off. If we
agree on that, it seems like a deeper fix is required, rather than the
seemingly obvious (InRecovery && full_page_writes), since the current
setting of that parameter doesn't necessarily reflect how it was set
historically over the full range of log files used for any recovery.
3. Checking on other possible InRecovery issues, I note in slru.c that
we ignore missing segments. As the comment explains, this is valid in
some circumstances, but not in all. If a file *should* be there because
the clog (etc) hasn't yet been truncated to that point we have a hole in
the sequence of segments. We do not detect that situation, just zero it
and march on even when zero_damaged_pages is *not* set. A message is
written to the log, though there is no way to tell from that the
difference between a valid and invalid occurrence of that message.
/*
* In a crash-and-restart situation, it's possible for us to receive
* commands to set the commit status of transactions whose bits are in
* already-truncated segments of the commit log (see notes in
* SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case
* where the file doesn't exist, and return zeroes instead.
*/
fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd < 0)
{
if (errno != ENOENT || !InRecovery)
4. In md.c we do not throw any kind of message when InRecovery we are
asked for a block beyond the current range. Again, in most recovery
circumstances this is not an error, but in this case there is no message
to show it ever happened so difficult to confirm it as error/non-error.
(The block address is no-doubt correct because it comes from a
CRC-checked WAL record, but we do not check the case that the file
should have existed and yet does not. There's no easy way of doing this
except for keeping track of such requests and matching them with later
truncations. That's non-trivial, so it might be best to just log the
fact that it happened so we can decide if its a problem manually - which
should be sufficient for most cases.
3 and 4 are slightly different cases of zeroing data, since we aren't
overwriting anything, just working around the case of missing data. But
I see an issue that not reporting potentially missing data is still
silent data loss.
Best Regards, Simon Riggs