Zeroing damaged pages - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Zeroing damaged pages |
Date | |
Msg-id | 1140697901.8759.111.camel@localhost.localdomain Whole thread Raw |
Responses |
Re: Zeroing damaged pages
|
List | pgsql-hackers |
There appears to be some issues or at least a lack of clarity with the way we zero damaged/missing pages in various circumstances. 1. If we have an invalid page header then we execute the code below (from bufmgr.c ReadBuffer()).. In the case that we are only reading the page, not writing to it, zero_damaged_pages causes the page to be overwritten with zeroes in memory, yet does not set the buffer BM_DIRTY. The page in memory is zeroed, but the on-disk copy is never fully zeroed because the page is never flushed back onto the database (unless a subsequent write does dirty the page). This behaviour is somewhat desirable from a support perspective, but isn't fully documented, so ad-hoc attempts to solve problems appear to have failed. VACUUM *does* write pages that it has "fixed" to disk, so lazy_scan_heap() operates differently to a SELECT. A patch prototype to make zero_damaged_pages work as advertised is enclosed, though the current behaviour may well be preferred, in which case a doc patch is more appropriate. However, since autovacuum the window of opportunity for support to assist with data recovery is smaller and somewhat random. VACUUM can come along at any time and destroy the remaining data on the corrupt page, destroying any benefit from the current behaviour. Also, since some damaged pages will be written to and some not, we have a fairly random amount of data loss, yet the logs don't differentiate between a page that has been zeroed-and-written-to-disk and a page that has been zeroed-and-ignored. Perhaps we might want a new parameter called ignore_damaged_pages, so that VACUUM would also ignore the bad data? At the very least we should have a doc patch explaining the current behaviour better so that people understand what to do when messages start appearing in the logs? in bufmgr.c ReadBuffer() we have this code starting line 240 if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* * During WAL recovery, the first access to any data page should * overwrite the whole page from the WAL; so a clobbered page * header is not reason to fail. Hence, when InRecovery we may * always act as though zero_damaged_pages is ON. */ if (zero_damaged_pages || InRecovery) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("invalid page header in block %u of relation \"%s\"; zeroing out page", blockNum, RelationGetRelationName(reln)))); MemSet((char *) bufBlock, 0, BLCKSZ); } else ... } if (isLocalBuf) { /* Only need to adjust flags */ bufHdr->flags |= BM_VALID | setdirty; } else { /* Set BM_VALID, terminate IO, and wake up any waiters */ TerminateBufferIO(bufHdr, false, BM_VALID | setdirty); } So the above code writes to a page, yet doesn't set BM_DIRTY. (It doesn't unset it either, but thats not exactly the same thing). 2. Also from the code, the assumption that InRecovery means we can ignore corrupt pages is only true when rolling forward on a set of logs that contains full_page_writes blocks. That seems like an invalid assumption when the page being accessed isn't going to be fully overwritten, as in the case when we set full_page_writes = off. If we agree on that, it seems like a deeper fix is required, rather than the seemingly obvious (InRecovery && full_page_writes), since the current setting of that parameter doesn't necessarily reflect how it was set historically over the full range of log files used for any recovery. 3. Checking on other possible InRecovery issues, I note in slru.c that we ignore missing segments. As the comment explains, this is valid in some circumstances, but not in all. If a file *should* be there because the clog (etc) hasn't yet been truncated to that point we have a hole in the sequence of segments. We do not detect that situation, just zero it and march on even when zero_damaged_pages is *not* set. A message is written to the log, though there is no way to tell from that the difference between a valid and invalid occurrence of that message. /* * In a crash-and-restart situation, it's possible for us to receive * commands to set the commit status of transactions whose bits are in * already-truncated segments of the commit log (see notes in * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); if (fd < 0) { if (errno != ENOENT || !InRecovery) 4. In md.c we do not throw any kind of message when InRecovery we are asked for a block beyond the current range. Again, in most recovery circumstances this is not an error, but in this case there is no message to show it ever happened so difficult to confirm it as error/non-error. (The block address is no-doubt correct because it comes from a CRC-checked WAL record, but we do not check the case that the file should have existed and yet does not. There's no easy way of doing this except for keeping track of such requests and matching them with later truncations. That's non-trivial, so it might be best to just log the fact that it happened so we can decide if its a problem manually - which should be sufficient for most cases. 3 and 4 are slightly different cases of zeroing data, since we aren't overwriting anything, just working around the case of missing data. But I see an issue that not reporting potentially missing data is still silent data loss. Best Regards, Simon Riggs
Attachment
pgsql-hackers by date: