Exceptional md.c paths for recovery and zero_damaged_pages - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Exceptional md.c paths for recovery and zero_damaged_pages |
Date | |
Msg-id | 3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd Whole thread Raw |
Responses |
Re: Exceptional md.c paths for recovery and zero_damaged_pages
Re: Exceptional md.c paths for recovery and zero_damaged_pages |
List | pgsql-hackers |
Hi, We have a fair number of special paths in md.c that are specific to recovery. E.g. in mdreadv() we do: v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); and /* * We are at or past EOF, or we read a partial block at EOF. * Normally this is an error; upper levels should never try to * read a nonexistent block. However, if zero_damaged_pages * is ON or we are InRecovery, we should instead return zeroes * without complaining. This allows, for example, the case of * trying to update a block that was later truncated away. */ if (zero_damaged_pages || InRecovery) { for (BlockNumber i = transferred_this_segment / BLCKSZ; i < nblocks_this_segment; ++i) memset(buffers[i], 0, BLCKSZ); break; } There's more, but this is sufficient to discuss. As far as I can tell, nearly all - including the above - InRecovery paths in md.c are basically unreachable. And have been for quite a while. XLogReadBufferExtended() takes care to a) Create the fork if it doesn't yet exist. b) Not to read from beyond EOF. If EOF is found, we extend the relation to be large enough. Which afaict should suffice to prevent needing the above? There *are* a few paths that don't go through XLogReadBufferExtended(), but they seem to have their own protection. We have a bunch of heap redo records that access the VM. But the VM takes care to call vm_extend() before reading a block beyond EOF. The specific shape of that protection in XLogReadBufferExtended() has evolved over time (e.g. in [1]), but it's been there in some form for a *long* time - the original REDO/UNDO commit by Vadim (b58c0411bad4). The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932 and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was just *restricted* to InRecovery in that commit. What bothers me is that I do have some dim memory of understanding why we might still need these paths, but I sure can't see it now. They're not reached in our tests according to coverage [2], and nuking them doesn't cause tests to fail, even if I run the tests with -Dsegsize_blocks=7. Is this dead code that we've whacked around a good bunch for well over 10 years? I must be missing something. I think the other InRecovery paths are probably reachable: - The checks in mdtruncate() make sense, if e.g. two subsequent vacuums truncated a relation twice in the same checkpoint and we crash during replaying those records, we'll have the relation truncated to the shorter size. I wonder if we'd be better off moving that condition to an argument to mdtruncate(), rather than checking InRecovery(). - The !InRecovery check in mdexist() is a recovery specific optimization. I think there are some callers that suffer from the automatic closing and don't need the protection, but that's a separate optimization. - The mdprefetch() InRecovery check migtht be reachable, we'll call it from the recovery readahead, which will before we extended the relation in XLogReadBufferExtended(). Separately from the above, I am concerned about that zero_damaged_pages specific path. For one, it's a very narrow path, we don't take zero_damaged_pages into account in mdreadv()'s call to _mdfd_getseg(), so there'll be an error if the actual end of the relation is in the last segment. But even if we fixed that, just making up a page of zeroes in mdreadv() seems like a bad idea, we'll end up with a page in shared buffers that's well beyond EOF. Which requires us to have some protections with scary error messages in the relation extension paths. If the page ends up being dirtied (e.g. because we initialize it), we'll write it out and cause a file with holes - or hit ENOSPC, because we've not even tried to make sure that the relation can be extended to cover it. The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems less sane to me. If you want to recover from a data corruption event and can't dump the data because a seqscan stumbles over an invalid page - zero_damaged_pages makes sense. Seqscans or tidscans won't reach the mdreadv() path, because they check the relation size first. Which leaves access from indexes - e.g. an index pointer beyond the end of the heap. But in that case it's not sane to use zero_damaged_pages, because that's almost a guarantee for worsening corruption in the future, because the now empty heap page will eventually be filled with new tuples, which now will be pointed to by index entries pointing that were created before the zeroing. So what's the point of having this path in mdreadv()? Greetings, Andres Freund [1] https://postgr.es/m/CAM-w4HNUgUcJqDNbgfAU%3DYjksHZDPPbT7RH73pYrzd7ByYrjrA%40mail.gmail.com [2] https://coverage.postgresql.org/src/backend/storage/smgr/md.c.gcov.html
pgsql-hackers by date: