Thread: Exceptional md.c paths for recovery and zero_damaged_pages
Hi, We have a fair number of special paths in md.c that are specific to recovery. E.g. in mdreadv() we do: v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); and /* * We are at or past EOF, or we read a partial block at EOF. * Normally this is an error; upper levels should never try to * read a nonexistent block. However, if zero_damaged_pages * is ON or we are InRecovery, we should instead return zeroes * without complaining. This allows, for example, the case of * trying to update a block that was later truncated away. */ if (zero_damaged_pages || InRecovery) { for (BlockNumber i = transferred_this_segment / BLCKSZ; i < nblocks_this_segment; ++i) memset(buffers[i], 0, BLCKSZ); break; } There's more, but this is sufficient to discuss. As far as I can tell, nearly all - including the above - InRecovery paths in md.c are basically unreachable. And have been for quite a while. XLogReadBufferExtended() takes care to a) Create the fork if it doesn't yet exist. b) Not to read from beyond EOF. If EOF is found, we extend the relation to be large enough. Which afaict should suffice to prevent needing the above? There *are* a few paths that don't go through XLogReadBufferExtended(), but they seem to have their own protection. We have a bunch of heap redo records that access the VM. But the VM takes care to call vm_extend() before reading a block beyond EOF. The specific shape of that protection in XLogReadBufferExtended() has evolved over time (e.g. in [1]), but it's been there in some form for a *long* time - the original REDO/UNDO commit by Vadim (b58c0411bad4). The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932 and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was just *restricted* to InRecovery in that commit. What bothers me is that I do have some dim memory of understanding why we might still need these paths, but I sure can't see it now. They're not reached in our tests according to coverage [2], and nuking them doesn't cause tests to fail, even if I run the tests with -Dsegsize_blocks=7. Is this dead code that we've whacked around a good bunch for well over 10 years? I must be missing something. I think the other InRecovery paths are probably reachable: - The checks in mdtruncate() make sense, if e.g. two subsequent vacuums truncated a relation twice in the same checkpoint and we crash during replaying those records, we'll have the relation truncated to the shorter size. I wonder if we'd be better off moving that condition to an argument to mdtruncate(), rather than checking InRecovery(). - The !InRecovery check in mdexist() is a recovery specific optimization. I think there are some callers that suffer from the automatic closing and don't need the protection, but that's a separate optimization. - The mdprefetch() InRecovery check migtht be reachable, we'll call it from the recovery readahead, which will before we extended the relation in XLogReadBufferExtended(). Separately from the above, I am concerned about that zero_damaged_pages specific path. For one, it's a very narrow path, we don't take zero_damaged_pages into account in mdreadv()'s call to _mdfd_getseg(), so there'll be an error if the actual end of the relation is in the last segment. But even if we fixed that, just making up a page of zeroes in mdreadv() seems like a bad idea, we'll end up with a page in shared buffers that's well beyond EOF. Which requires us to have some protections with scary error messages in the relation extension paths. If the page ends up being dirtied (e.g. because we initialize it), we'll write it out and cause a file with holes - or hit ENOSPC, because we've not even tried to make sure that the relation can be extended to cover it. The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems less sane to me. If you want to recover from a data corruption event and can't dump the data because a seqscan stumbles over an invalid page - zero_damaged_pages makes sense. Seqscans or tidscans won't reach the mdreadv() path, because they check the relation size first. Which leaves access from indexes - e.g. an index pointer beyond the end of the heap. But in that case it's not sane to use zero_damaged_pages, because that's almost a guarantee for worsening corruption in the future, because the now empty heap page will eventually be filled with new tuples, which now will be pointed to by index entries pointing that were created before the zeroing. So what's the point of having this path in mdreadv()? Greetings, Andres Freund [1] https://postgr.es/m/CAM-w4HNUgUcJqDNbgfAU%3DYjksHZDPPbT7RH73pYrzd7ByYrjrA%40mail.gmail.com [2] https://coverage.postgresql.org/src/backend/storage/smgr/md.c.gcov.html
Andres Freund <andres@anarazel.de> writes: > We have a fair number of special paths in md.c that are specific to > recovery. E.g. in mdreadv() we do: > ... > As far as I can tell, nearly all - including the above - InRecovery paths in > md.c are basically unreachable. And have been for quite a while. > XLogReadBufferExtended() takes care to > a) Create the fork if it doesn't yet exist. > b) Not to read from beyond EOF. If EOF is found, we extend the relation to be > large enough. > Which afaict should suffice to prevent needing the above? I haven't checked the git history, but I suspect this logic is later than the md.c code you mention, and may well render it obsolete. > The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932 > and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was > just *restricted* to InRecovery in that commit. We definitely needed 303e46ea932 at the time, but that doesn't mean we still do. Possibly ef07221997e was just copying the earlier logic. regards, tom lane
Hi, On 2024-12-13 19:06:16 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > We have a fair number of special paths in md.c that are specific to > > recovery. E.g. in mdreadv() we do: > > ... > > As far as I can tell, nearly all - including the above - InRecovery paths in > > md.c are basically unreachable. And have been for quite a while. > > > XLogReadBufferExtended() takes care to > > a) Create the fork if it doesn't yet exist. > > b) Not to read from beyond EOF. If EOF is found, we extend the relation to be > > large enough. > > Which afaict should suffice to prevent needing the above? > > I haven't checked the git history, but I suspect this logic is later > than the md.c code you mention, and may well render it obsolete. Oddly enough, not really - the logic originated in the original REDO/UNDO commit, in 2000 (b58c0411bad4). > > The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932 > > and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was > > just *restricted* to InRecovery in that commit. > > We definitely needed 303e46ea932 at the time, but that doesn't mean > we still do. I'd assume it was needed at the time, I just can't quite figure out by what. We used to have more code using the fake relcache stuff, which indicates that those places weren't going through extend-the-file logic in XLogReadBufferExtended(). E.g. the "incomplete split" logic that Heikki removed from a bunch of indexes. Another theory is that it could have been related to wal_level=minimal. Before Noah's redesign in c6b92041d38 there were rather weird mixes of logged and unlogged modifications to the same relfilenode. I don't quite see how it could have required this behaviour in mdreadv() though. > Possibly ef07221997e was just copying the earlier logic. It did note this: Also, remove the kluge of allowing mdread() to return zeroes from a read-beyond-EOF: this is now an error condition except when InRecovery or zero_damaged_pages = true. (Hash indexes used to require that behavior, but no more.) which is about this hunk: /* - * If we are at or past EOF, return zeroes without complaining. Also - * substitute zeroes if we found a partial block at EOF. - * - * XXX this is really ugly, bad design. However the current - * implementation of hash indexes requires it, because hash index - * pages are initialized out-of-order. + * Short read: we are at or past EOF, or we read a partial block at + * EOF. Normally this is an error; upper levels should never try to + * read a nonexistent block. However, if zero_damaged_pages is ON + * or we are InRecovery, we should instead return zeroes without + * complaining. This allows, for example, the case of trying to + * update a block that was later truncated away. */ Before this there was no mention of recovery, so perhaps the InRecovery bit was just trying to not break recovery (there weren't any tests for that at the time, I think, so that'd be quite resonable). I'm wondering if we should just put an error into the relevant paths in HEAD and see whether it triggers for anybody in the next months. Having all these untested paths in md.c forever doesn't seem great. Greetings, Andres Freund
On 14/12/2024 01:44, Andres Freund wrote: > The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems > less sane to me. If you want to recover from a data corruption event and > can't dump the data because a seqscan stumbles over an invalid page - > zero_damaged_pages makes sense. > > Seqscans or tidscans won't reach the mdreadv() path, because they check the > relation size first. Which leaves access from indexes - e.g. an index pointer > beyond the end of the heap. But in that case it's not sane to use > zero_damaged_pages, because that's almost a guarantee for worsening corruption > in the future, because the now empty heap page will eventually be filled with > new tuples, which now will be pointed to by index entries pointing that were > created before the zeroing. Well, if you need to do zero_damage_pages=off, you're screwed already, so I don't know think the worsening corruption argument matters much. And you have the same problem by pages zeroed by a seqscan too. To avoid that, you'd want to mark the page explicitly as "damaged, do not reuse" rather than just zero it, but that'd be a lot of new code. Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an error or even a warning if the heap tuple isn't found. That would seem like a useful sanity check. An index tuple should never point to a non-existent heap TID I believe. > I'm wondering if we should just put an error into the relevant paths in HEAD > and see whether it triggers for anybody in the next months. Having all these > untested paths in md.c forever doesn't seem great. +1 -- Heikki Linnakangas Neon (https://neon.tech)
On Tue, Dec 17, 2024 at 12:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an > error or even a warning if the heap tuple isn't found. That would seem > like a useful sanity check. An index tuple should never point to a > non-existent heap TID I believe. I think that it is necessary. We need to be prepared to perform a TID lookup with a wild page offset number to account for concurrent TID recycling. At least with nbtree plain index scans. This is also why we can only test the sanity of TIDs in certain particular contexts. See the comments above index_delete_check_htid(). > > I'm wondering if we should just put an error into the relevant paths in HEAD > > and see whether it triggers for anybody in the next months. Having all these > > untested paths in md.c forever doesn't seem great. > > +1 Try it with this test case: https://postgr.es/m/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com Just adapt it to nbtree, by removing "USING GIST", and by forcing plain index scans (concurrent TID recycling is prevented by means of holding onto a leaf page buffer pin with index-only scans). My guess is that adapting the test case like this will demonstrate that you really do need to be prepared for concurrent TID recycling that leads to accessing out-of-page-bounds TIDs in places like index_fetch_heap(). -- Peter Geoghegan
Hi, On 2024-12-17 19:57:13 +0200, Heikki Linnakangas wrote: > On 14/12/2024 01:44, Andres Freund wrote: > > The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems > > less sane to me. If you want to recover from a data corruption event and > > can't dump the data because a seqscan stumbles over an invalid page - > > zero_damaged_pages makes sense. > > > > Seqscans or tidscans won't reach the mdreadv() path, because they check the > > relation size first. Which leaves access from indexes - e.g. an index pointer > > beyond the end of the heap. But in that case it's not sane to use > > zero_damaged_pages, because that's almost a guarantee for worsening corruption > > in the future, because the now empty heap page will eventually be filled with > > new tuples, which now will be pointed to by index entries pointing that were > > created before the zeroing. > > Well, if you need to do zero_damage_pages=off, you're screwed already, so I > don't know think the worsening corruption argument matters much. Well, it matters in the sense of it being important to keep seqscans somewhat working, as that's required for extracting as much data as possible with pg_dump. But I don't think there's an equivalent need to keep seqscans working, given that the only valid action is to reindex anyway. > And you have the same problem by pages zeroed by a seqscan too. I don't think so? For seqscans we should *never* hit the "zero a page beyond EOF" path, because the heapscan will check the relation size at the start of the scan. You definitely can hit the case of zeroing a heap page, but that page will still correspond to an on-disk page. I'd like to get rid of the "zero a page beyond EOF" code, without touching the "zeroing out page" path that's currently in WaitReadBuffers(). As probably obvious, the reason for this is that I have to rejigger the code for AIO, and I don't want to write and test code that never makes sense to reach. > > I'm wondering if we should just put an error into the relevant paths in HEAD > > and see whether it triggers for anybody in the next months. Having all these > > untested paths in md.c forever doesn't seem great. > > +1 Cool. Will write something up. Greetings, Andres Freund
Hi, On 2024-12-17 16:11:53 -0500, Peter Geoghegan wrote: > On Tue, Dec 17, 2024 at 12:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an > > error or even a warning if the heap tuple isn't found. That would seem > > like a useful sanity check. An index tuple should never point to a > > non-existent heap TID I believe. > > I think that it is necessary. We need to be prepared to perform a TID > lookup with a wild page offset number to account for concurrent TID > recycling. At least with nbtree plain index scans. ISTM that we could do better with some fairly simple cooperation between index and table AM. It should be rather rare to look up a TID that was removed between finding the index entry and fetching the table entry, a small amount of extra work in that case ought to be ok. Could we e.g., for logged tables, track the LSN of the leaf index page in the IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in if the table relation has a newer LSN? That leaves a small window for a false-negative, but shouldn't have any false-positives? I've seen enough bugs / corruption leading to indexes pointing to wrong and nonexisting tuples to make me think it's worth being a bit more proactive about raising errors for such cases. Of course what I described above can only detect index entries pointing to non-existing tuples, but that's still a good bit better than nothing. Greetings, Andres Freund
On Tue, Dec 17, 2024 at 4:46 PM Andres Freund <andres@anarazel.de> wrote: > ISTM that we could do better with some fairly simple cooperation between index > and table AM. It should be rather rare to look up a TID that was removed > between finding the index entry and fetching the table entry, a small amount > of extra work in that case ought to be ok. Maybe, but I just want to be clear: as far as I know, as things stand we're very permissive about what can happen around concurrent TID recycling. We need to be because bitmap index scans can build a bitmap based on the index as it was some time ago (among other reasons), which cannot prevent concurrent TID recycling for TIDs that point to dead-to-everybody tuples (or point to LP_DEAD heap page stubs). > Could we e.g., for logged tables, track the LSN of the leaf index page in the > IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in > if the table relation has a newer LSN? That leaves a small window for a > false-negative, but shouldn't have any false-positives? Technically that would work, but it might not be very useful. It's very typical for a heap page LSN to be older than a corresponding index leaf page LSN, since inserts start with the heap tuple insertion (the index tuple insertion must happen afterwards). Plus index pages almost always store more tuples than their corresponding heap pages and so are presumably more likely to be modified. > I've seen enough bugs / corruption leading to indexes pointing to wrong and > nonexisting tuples to make me think it's worth being a bit more proactive > about raising errors for such cases. Of course what I described above can > only detect index entries pointing to non-existing tuples, but that's still > a good bit better than nothing. OTOH we've had index_delete_check_htid for several years now, and I've yet to hear a report involving one of its errors. That's not conclusive, but it does suggest that this might not be a huge problem. -- Peter Geoghegan
Hi, On 2024-12-17 17:07:34 -0500, Peter Geoghegan wrote: > On Tue, Dec 17, 2024 at 4:46 PM Andres Freund <andres@anarazel.de> wrote: > > ISTM that we could do better with some fairly simple cooperation between index > > and table AM. It should be rather rare to look up a TID that was removed > > between finding the index entry and fetching the table entry, a small amount > > of extra work in that case ought to be ok. > > Maybe, but I just want to be clear: as far as I know, as things stand > we're very permissive about what can happen around concurrent TID > recycling. We need to be because bitmap index scans can build a bitmap > based on the index as it was some time ago (among other reasons), > which cannot prevent concurrent TID recycling for TIDs that point to > dead-to-everybody tuples (or point to LP_DEAD heap page stubs). Bitmap heapscans go through a distinct path, so it'd be trivial to accept tids pointing into the void for bitmap scans but not index [only] scans. > > Could we e.g., for logged tables, track the LSN of the leaf index page in the > > IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in > > if the table relation has a newer LSN? That leaves a small window for a > > false-negative, but shouldn't have any false-positives? > > Technically that would work, but it might not be very useful. > > It's very typical for a heap page LSN to be older than a corresponding > index leaf page LSN, since inserts start with the heap tuple insertion > (the index tuple insertion must happen afterwards). Plus index pages > almost always store more tuples than their corresponding heap pages > and so are presumably more likely to be modified. I think the case where we'd need to *not* error out would be if the heap page LSN is *newer* than the index leaf page LSN, as that could be due to lazy_vacuum_heap() marking items unused between getting the tid from the index scan and the heap access. For it to be legal to remove items from the heap page between the scan of the leaf page and the heap fetch, the btree page would need to have been processed by btvacuumpage() (with a cleanup lock) and the heap page by lazy_vacuum_heap_page(). If the heap page has an LSN that's older than the leaf page's LSN when we had it pinned/locked, it could not have been processed by lazy_vacuum_heap_page() since. But if the heap page's LSN is newer, it might have. Obviously there might be other reasons for the LSN to increase, leading to false negatives, but it's a quite tight window. > > I've seen enough bugs / corruption leading to indexes pointing to wrong and > > nonexisting tuples to make me think it's worth being a bit more proactive > > about raising errors for such cases. Of course what I described above can > > only detect index entries pointing to non-existing tuples, but that's still > > a good bit better than nothing. > > OTOH we've had index_delete_check_htid for several years now, and I've > yet to hear a report involving one of its errors. That's not > conclusive, but it does suggest that this might not be a huge problem. Not sure that tells us that much, there's lots of workloads where you'd never end up in heap_index_delete_tuples(). Greetings, Andres Freund
On Tue, Dec 17, 2024 at 5:27 PM Andres Freund <andres@anarazel.de> wrote: > Bitmap heapscans go through a distinct path, so it'd be trivial to accept tids > pointing into the void for bitmap scans but not index [only] scans. I wasn't clear. Plain nbtree index scans can drop their leaf page buffer pin, which makes them prone to following TIDs subject to concurrent TID recycling. That's the case that any new sanity check in this area will need to not break. It's fairly similar to what we need to do with bitmap index scans, but *does* use this path. I actually think that we might well need to extend this "drop pin to avoid blocking VACUUM" behavior to GiST and SP-GiST before long (to avoid regressions when we fix the index-only scan bug affecting GIST + SP-GiST). I think that it (i.e. the _bt_drop_lock_and_maybe_pin stuff) needs to be generalized, since it's a fairly index-AM-independent thing. > I think the case where we'd need to *not* error out would be if the heap page > LSN is *newer* than the index leaf page LSN, as that could be due to > lazy_vacuum_heap() marking items unused between getting the tid from the index > scan and the heap access. Ah, okay. > For it to be legal to remove items from the heap page between the scan of the > leaf page and the heap fetch, the btree page would need to have been processed > by btvacuumpage() (with a cleanup lock) and the heap page by > lazy_vacuum_heap_page(). Just to be clear, the cleanup lock isn't only protective for scans that don't drop their pin. This is why we need to worry about plain nbtree index scans that use an MVCC snapshot. (I think that you know this already, but just want to be clear.) -- Peter Geoghegan
On 17/12/2024 23:28, Andres Freund wrote: > On 2024-12-17 19:57:13 +0200, Heikki Linnakangas wrote: >> On 14/12/2024 01:44, Andres Freund wrote: >>> The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems >>> less sane to me. If you want to recover from a data corruption event and >>> can't dump the data because a seqscan stumbles over an invalid page - >>> zero_damaged_pages makes sense. >>> >>> Seqscans or tidscans won't reach the mdreadv() path, because they check the >>> relation size first. Which leaves access from indexes - e.g. an index pointer >>> beyond the end of the heap. But in that case it's not sane to use >>> zero_damaged_pages, because that's almost a guarantee for worsening corruption >>> in the future, because the now empty heap page will eventually be filled with >>> new tuples, which now will be pointed to by index entries pointing that were >>> created before the zeroing. >> >> Well, if you need to do zero_damage_pages=off, you're screwed already, so I >> don't know think the worsening corruption argument matters much. > > Well, it matters in the sense of it being important to keep seqscans somewhat > working, as that's required for extracting as much data as possible with > pg_dump. But I don't think there's an equivalent need to keep seqscans > working, given that the only valid action is to reindex anyway. > > >> And you have the same problem by pages zeroed by a seqscan too. > > I don't think so? For seqscans we should *never* hit the "zero a page beyond > EOF" path, because the heapscan will check the relation size at the start of > the scan. You definitely can hit the case of zeroing a heap page, but that > page will still correspond to an on-disk page. I meant that this scenario: 0. Heap block 123 has some live tuples on it, but it is corrupt. 1. set zero_damaged_pages=on 2. Perform seqscan. It zeroes block 123 3. set zero_damaged_pages=off 4. Insert new tuples. They get inserted to block 123. Any index entries for the original heap tuples on block 123 that got zeroed out will now incorrectly point to the new tuples you inserted. No reads beyond EOF involved. -- Heikki Linnakangas Neon (https://neon.tech)