Thread: Exceptional md.c paths for recovery and zero_damaged_pages

Exceptional md.c paths for recovery and zero_damaged_pages

From
Andres Freund
Date:
Hi,

We have a fair number of special paths in md.c that are specific to
recovery. E.g. in mdreadv() we do:

        v = _mdfd_getseg(reln, forknum, blocknum, false,
                         EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
and

                /*
                 * We are at or past EOF, or we read a partial block at EOF.
                 * Normally this is an error; upper levels should never try to
                 * read a nonexistent block.  However, if zero_damaged_pages
                 * is ON or we are InRecovery, we should instead return zeroes
                 * without complaining.  This allows, for example, the case of
                 * trying to update a block that was later truncated away.
                 */
                if (zero_damaged_pages || InRecovery)
                {
                    for (BlockNumber i = transferred_this_segment / BLCKSZ;
                         i < nblocks_this_segment;
                         ++i)
                        memset(buffers[i], 0, BLCKSZ);
                    break;
                }

There's more, but this is sufficient to discuss.


As far as I can tell, nearly all - including the above - InRecovery paths in
md.c are basically unreachable. And have been for quite a while.

XLogReadBufferExtended() takes care to

a) Create the fork if it doesn't yet exist.

b) Not to read from beyond EOF. If EOF is found, we extend the relation to be
   large enough.

Which afaict should suffice to prevent needing the above?

There *are* a few paths that don't go through XLogReadBufferExtended(), but
they seem to have their own protection. We have a bunch of heap redo records
that access the VM. But the VM takes care to call vm_extend() before reading a
block beyond EOF.


The specific shape of that protection in XLogReadBufferExtended() has evolved
over time (e.g. in [1]), but it's been there in some form for a *long* time -
the original REDO/UNDO commit by Vadim (b58c0411bad4).

The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932
and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was
just *restricted* to InRecovery in that commit.

What bothers me is that I do have some dim memory of understanding why we
might still need these paths, but I sure can't see it now.

They're not reached in our tests according to coverage [2], and nuking them
doesn't cause tests to fail, even if I run the tests with -Dsegsize_blocks=7.


Is this dead code that we've whacked around a good bunch for well over 10
years?  I must be missing something.


I think the other InRecovery paths are probably reachable:

- The checks in mdtruncate() make sense, if e.g. two subsequent vacuums
  truncated a relation twice in the same checkpoint and we crash during
  replaying those records, we'll have the relation truncated to the shorter
  size.

  I wonder if we'd be better off moving that condition to an argument to
  mdtruncate(), rather than checking InRecovery().


- The !InRecovery check in mdexist() is a recovery specific optimization. I
  think there are some callers that suffer from the automatic closing and
  don't need the protection, but that's a separate optimization.

- The mdprefetch() InRecovery check migtht be reachable, we'll call it from
  the recovery readahead, which will before we extended the relation in
  XLogReadBufferExtended().



Separately from the above, I am concerned about that zero_damaged_pages
specific path.

For one, it's a very narrow path, we don't take zero_damaged_pages into
account in mdreadv()'s call to _mdfd_getseg(), so there'll be an error if the
actual end of the relation is in the last segment.

But even if we fixed that, just making up a page of zeroes in mdreadv() seems
like a bad idea, we'll end up with a page in shared buffers that's well beyond
EOF. Which requires us to have some protections with scary error messages in
the relation extension paths.

If the page ends up being dirtied (e.g. because we initialize it), we'll write
it out and cause a file with holes - or hit ENOSPC, because we've not even
tried to make sure that the relation can be extended to cover it.

The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems
less sane to me.  If you want to recover from a data corruption event and
can't dump the data because a seqscan stumbles over an invalid page -
zero_damaged_pages makes sense.

Seqscans or tidscans won't reach the mdreadv() path, because they check the
relation size first.  Which leaves access from indexes - e.g. an index pointer
beyond the end of the heap.  But in that case it's not sane to use
zero_damaged_pages, because that's almost a guarantee for worsening corruption
in the future, because the now empty heap page will eventually be filled with
new tuples, which now will be pointed to by index entries pointing that were
created before the zeroing.

So what's the point of having this path in mdreadv()?



Greetings,

Andres Freund

[1] https://postgr.es/m/CAM-w4HNUgUcJqDNbgfAU%3DYjksHZDPPbT7RH73pYrzd7ByYrjrA%40mail.gmail.com
[2] https://coverage.postgresql.org/src/backend/storage/smgr/md.c.gcov.html



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> We have a fair number of special paths in md.c that are specific to
> recovery. E.g. in mdreadv() we do:
> ...
> As far as I can tell, nearly all - including the above - InRecovery paths in
> md.c are basically unreachable. And have been for quite a while.

> XLogReadBufferExtended() takes care to
> a) Create the fork if it doesn't yet exist.
> b) Not to read from beyond EOF. If EOF is found, we extend the relation to be
>    large enough.
> Which afaict should suffice to prevent needing the above?

I haven't checked the git history, but I suspect this logic is later
than the md.c code you mention, and may well render it obsolete.

> The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932
> and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was
> just *restricted* to InRecovery in that commit.

We definitely needed 303e46ea932 at the time, but that doesn't mean
we still do.  Possibly ef07221997e was just copying the earlier logic.

            regards, tom lane



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Andres Freund
Date:
Hi,

On 2024-12-13 19:06:16 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > We have a fair number of special paths in md.c that are specific to
> > recovery. E.g. in mdreadv() we do:
> > ...
> > As far as I can tell, nearly all - including the above - InRecovery paths in
> > md.c are basically unreachable. And have been for quite a while.
> 
> > XLogReadBufferExtended() takes care to
> > a) Create the fork if it doesn't yet exist.
> > b) Not to read from beyond EOF. If EOF is found, we extend the relation to be
> >    large enough.
> > Which afaict should suffice to prevent needing the above?
> 
> I haven't checked the git history, but I suspect this logic is later
> than the md.c code you mention, and may well render it obsolete.

Oddly enough, not really - the logic originated in the original REDO/UNDO
commit, in 2000 (b58c0411bad4).


> > The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932
> > and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was
> > just *restricted* to InRecovery in that commit.
> 
> We definitely needed 303e46ea932 at the time, but that doesn't mean
> we still do.

I'd assume it was needed at the time, I just can't quite figure out by what.

We used to have more code using the fake relcache stuff, which indicates that
those places weren't going through extend-the-file logic in
XLogReadBufferExtended().  E.g. the "incomplete split" logic that Heikki
removed from a bunch of indexes.

Another theory is that it could have been related to wal_level=minimal. Before
Noah's redesign in c6b92041d38 there were rather weird mixes of logged and
unlogged modifications to the same relfilenode. I don't quite see how it could
have required this behaviour in mdreadv() though.


> Possibly ef07221997e was just copying the earlier logic.

It did note this:
    Also, remove the kluge of allowing mdread()
    to return zeroes from a read-beyond-EOF: this is now an error condition
    except when InRecovery or zero_damaged_pages = true.  (Hash indexes used to
    require that behavior, but no more.)

which is about this hunk:

         /*
-         * If we are at or past EOF, return zeroes without complaining. Also
-         * substitute zeroes if we found a partial block at EOF.
-         *
-         * XXX this is really ugly, bad design.  However the current
-         * implementation of hash indexes requires it, because hash index
-         * pages are initialized out-of-order.
+         * Short read: we are at or past EOF, or we read a partial block at
+         * EOF.  Normally this is an error; upper levels should never try to
+         * read a nonexistent block.  However, if zero_damaged_pages is ON
+         * or we are InRecovery, we should instead return zeroes without
+         * complaining.  This allows, for example, the case of trying to
+         * update a block that was later truncated away.
          */

Before this there was no mention of recovery, so perhaps the InRecovery bit
was just trying to not break recovery (there weren't any tests for that at the
time, I think, so that'd be quite resonable).


I'm wondering if we should just put an error into the relevant paths in HEAD
and see whether it triggers for anybody in the next months. Having all these
untested paths in md.c forever doesn't seem great.


Greetings,

Andres Freund



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Heikki Linnakangas
Date:
On 14/12/2024 01:44, Andres Freund wrote:
> The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems
> less sane to me.  If you want to recover from a data corruption event and
> can't dump the data because a seqscan stumbles over an invalid page -
> zero_damaged_pages makes sense.
> 
> Seqscans or tidscans won't reach the mdreadv() path, because they check the
> relation size first.  Which leaves access from indexes - e.g. an index pointer
> beyond the end of the heap.  But in that case it's not sane to use
> zero_damaged_pages, because that's almost a guarantee for worsening corruption
> in the future, because the now empty heap page will eventually be filled with
> new tuples, which now will be pointed to by index entries pointing that were
> created before the zeroing.

Well, if you need to do zero_damage_pages=off, you're screwed already, 
so I don't know think the worsening corruption argument matters much. 
And you have the same problem by pages zeroed by a seqscan too. To avoid 
that, you'd want to mark the page explicitly as "damaged, do not reuse" 
rather than just zero it, but that'd be a lot of new code.

Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an 
error or even a warning if the heap tuple isn't found. That would seem 
like a useful sanity check. An index tuple should never point to a 
non-existent heap TID I believe.

> I'm wondering if we should just put an error into the relevant paths in HEAD
> and see whether it triggers for anybody in the next months. Having all these
> untested paths in md.c forever doesn't seem great.

+1

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Peter Geoghegan
Date:
On Tue, Dec 17, 2024 at 12:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an
> error or even a warning if the heap tuple isn't found. That would seem
> like a useful sanity check. An index tuple should never point to a
> non-existent heap TID I believe.

I think that it is necessary. We need to be prepared to perform a TID
lookup with a wild page offset number to account for concurrent TID
recycling. At least with nbtree plain index scans.

This is also why we can only test the sanity of TIDs in certain
particular contexts. See the comments above index_delete_check_htid().

> > I'm wondering if we should just put an error into the relevant paths in HEAD
> > and see whether it triggers for anybody in the next months. Having all these
> > untested paths in md.c forever doesn't seem great.
>
> +1

Try it with this test case:

https://postgr.es/m/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com

Just adapt it to nbtree, by removing "USING GIST", and by forcing
plain index scans (concurrent TID recycling is prevented by means of
holding onto a leaf page buffer pin with index-only scans).

My guess is that adapting the test case like this will demonstrate
that you really do need to be prepared for concurrent TID recycling
that leads to accessing out-of-page-bounds TIDs in places like
index_fetch_heap().

--
Peter Geoghegan



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Andres Freund
Date:
Hi,

On 2024-12-17 19:57:13 +0200, Heikki Linnakangas wrote:
> On 14/12/2024 01:44, Andres Freund wrote:
> > The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems
> > less sane to me.  If you want to recover from a data corruption event and
> > can't dump the data because a seqscan stumbles over an invalid page -
> > zero_damaged_pages makes sense.
> > 
> > Seqscans or tidscans won't reach the mdreadv() path, because they check the
> > relation size first.  Which leaves access from indexes - e.g. an index pointer
> > beyond the end of the heap.  But in that case it's not sane to use
> > zero_damaged_pages, because that's almost a guarantee for worsening corruption
> > in the future, because the now empty heap page will eventually be filled with
> > new tuples, which now will be pointed to by index entries pointing that were
> > created before the zeroing.
> 
> Well, if you need to do zero_damage_pages=off, you're screwed already, so I
> don't know think the worsening corruption argument matters much.

Well, it matters in the sense of it being important to keep seqscans somewhat
working, as that's required for extracting as much data as possible with
pg_dump. But I don't think there's an equivalent need to keep seqscans
working, given that the only valid action is to reindex anyway.


> And you have the same problem by pages zeroed by a seqscan too.

I don't think so? For seqscans we should *never* hit the "zero a page beyond
EOF" path, because the heapscan will check the relation size at the start of
the scan.  You definitely can hit the case of zeroing a heap page, but that
page will still correspond to an on-disk page.


I'd like to get rid of the "zero a page beyond EOF" code, without touching the
"zeroing out page" path that's currently in WaitReadBuffers().

As probably obvious, the reason for this is that I have to rejigger the code
for AIO, and I don't want to write and test code that never makes sense to
reach.



> > I'm wondering if we should just put an error into the relevant paths in HEAD
> > and see whether it triggers for anybody in the next months. Having all these
> > untested paths in md.c forever doesn't seem great.
> 
> +1

Cool. Will write something up.

Greetings,

Andres Freund



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Andres Freund
Date:
Hi,

On 2024-12-17 16:11:53 -0500, Peter Geoghegan wrote:
> On Tue, Dec 17, 2024 at 12:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Hmm, looking at index_fetch_heap(), I'm surprised it doesn't throw an
> > error or even a warning if the heap tuple isn't found. That would seem
> > like a useful sanity check. An index tuple should never point to a
> > non-existent heap TID I believe.
> 
> I think that it is necessary. We need to be prepared to perform a TID
> lookup with a wild page offset number to account for concurrent TID
> recycling. At least with nbtree plain index scans.

ISTM that we could do better with some fairly simple cooperation between index
and table AM. It should be rather rare to look up a TID that was removed
between finding the index entry and fetching the table entry, a small amount
of extra work in that case ought to be ok.

Could we e.g., for logged tables, track the LSN of the leaf index page in the
IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in
if the table relation has a newer LSN?  That leaves a small window for a
false-negative, but shouldn't have any false-positives?

I've seen enough bugs / corruption leading to indexes pointing to wrong and
nonexisting tuples to make me think it's worth being a bit more proactive
about raising errors for such cases.  Of course what I described above can
only detect index entries pointing to non-existing tuples, but that's still
a good bit better than nothing.

Greetings,

Andres Freund



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Peter Geoghegan
Date:
On Tue, Dec 17, 2024 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that we could do better with some fairly simple cooperation between index
> and table AM. It should be rather rare to look up a TID that was removed
> between finding the index entry and fetching the table entry, a small amount
> of extra work in that case ought to be ok.

Maybe, but I just want to be clear: as far as I know, as things stand
we're very permissive about what can happen around concurrent TID
recycling. We need to be because bitmap index scans can build a bitmap
based on the index as it was some time ago (among other reasons),
which cannot prevent concurrent TID recycling for TIDs that point to
dead-to-everybody tuples (or point to LP_DEAD heap page stubs).

> Could we e.g., for logged tables, track the LSN of the leaf index page in the
> IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in
> if the table relation has a newer LSN?  That leaves a small window for a
> false-negative, but shouldn't have any false-positives?

Technically that would work, but it might not be very useful.

It's very typical for a heap page LSN to be older than a corresponding
index leaf page LSN, since inserts start with the heap tuple insertion
(the index tuple insertion must happen afterwards). Plus index pages
almost always store more tuples than their corresponding heap pages
and so are presumably more likely to be modified.

> I've seen enough bugs / corruption leading to indexes pointing to wrong and
> nonexisting tuples to make me think it's worth being a bit more proactive
> about raising errors for such cases.  Of course what I described above can
> only detect index entries pointing to non-existing tuples, but that's still
> a good bit better than nothing.

OTOH we've had index_delete_check_htid for several years now, and I've
yet to hear a report involving one of its errors. That's not
conclusive, but it does suggest that this might not be a huge problem.

--
Peter Geoghegan



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Andres Freund
Date:
Hi,

On 2024-12-17 17:07:34 -0500, Peter Geoghegan wrote:
> On Tue, Dec 17, 2024 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> > ISTM that we could do better with some fairly simple cooperation between index
> > and table AM. It should be rather rare to look up a TID that was removed
> > between finding the index entry and fetching the table entry, a small amount
> > of extra work in that case ought to be ok.
> 
> Maybe, but I just want to be clear: as far as I know, as things stand
> we're very permissive about what can happen around concurrent TID
> recycling. We need to be because bitmap index scans can build a bitmap
> based on the index as it was some time ago (among other reasons),
> which cannot prevent concurrent TID recycling for TIDs that point to
> dead-to-everybody tuples (or point to LP_DEAD heap page stubs).

Bitmap heapscans go through a distinct path, so it'd be trivial to accept tids
pointing into the void for bitmap scans but not index [only] scans.


> > Could we e.g., for logged tables, track the LSN of the leaf index page in the
> > IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in
> > if the table relation has a newer LSN?  That leaves a small window for a
> > false-negative, but shouldn't have any false-positives?
> 
> Technically that would work, but it might not be very useful.
> 
> It's very typical for a heap page LSN to be older than a corresponding
> index leaf page LSN, since inserts start with the heap tuple insertion
> (the index tuple insertion must happen afterwards). Plus index pages
> almost always store more tuples than their corresponding heap pages
> and so are presumably more likely to be modified.

I think the case where we'd need to *not* error out would be if the heap page
LSN is *newer* than the index leaf page LSN, as that could be due to
lazy_vacuum_heap() marking items unused between getting the tid from the index
scan and the heap access.

For it to be legal to remove items from the heap page between the scan of the
leaf page and the heap fetch, the btree page would need to have been processed
by btvacuumpage() (with a cleanup lock) and the heap page by
lazy_vacuum_heap_page().

If the heap page has an LSN that's older than the leaf page's LSN when we had
it pinned/locked, it could not have been processed by lazy_vacuum_heap_page()
since.  But if the heap page's LSN is newer, it might have. Obviously there
might be other reasons for the LSN to increase, leading to false negatives,
but it's a quite tight window.


> > I've seen enough bugs / corruption leading to indexes pointing to wrong and
> > nonexisting tuples to make me think it's worth being a bit more proactive
> > about raising errors for such cases.  Of course what I described above can
> > only detect index entries pointing to non-existing tuples, but that's still
> > a good bit better than nothing.
> 
> OTOH we've had index_delete_check_htid for several years now, and I've
> yet to hear a report involving one of its errors. That's not
> conclusive, but it does suggest that this might not be a huge problem.

Not sure that tells us that much, there's lots of workloads where you'd never
end up in heap_index_delete_tuples().

Greetings,

Andres Freund



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Peter Geoghegan
Date:
On Tue, Dec 17, 2024 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:
> Bitmap heapscans go through a distinct path, so it'd be trivial to accept tids
> pointing into the void for bitmap scans but not index [only] scans.

I wasn't clear. Plain nbtree index scans can drop their leaf page
buffer pin, which makes them prone to following TIDs subject to
concurrent TID recycling. That's the case that any new sanity check in
this area will need to not break. It's fairly similar to what we need
to do with bitmap index scans, but *does* use this path.

I actually think that we might well need to extend this "drop pin to
avoid blocking VACUUM" behavior to GiST and SP-GiST before long (to
avoid regressions when we fix the index-only scan bug affecting GIST +
SP-GiST). I think that it (i.e. the _bt_drop_lock_and_maybe_pin stuff)
needs to be generalized, since it's a fairly index-AM-independent
thing.

> I think the case where we'd need to *not* error out would be if the heap page
> LSN is *newer* than the index leaf page LSN, as that could be due to
> lazy_vacuum_heap() marking items unused between getting the tid from the index
> scan and the heap access.

Ah, okay.

> For it to be legal to remove items from the heap page between the scan of the
> leaf page and the heap fetch, the btree page would need to have been processed
> by btvacuumpage() (with a cleanup lock) and the heap page by
> lazy_vacuum_heap_page().

Just to be clear, the cleanup lock isn't only protective for scans
that don't drop their pin. This is why we need to worry about plain
nbtree index scans that use an MVCC snapshot. (I think that you know
this already, but just want to be clear.)

--
Peter Geoghegan



Re: Exceptional md.c paths for recovery and zero_damaged_pages

From
Heikki Linnakangas
Date:
On 17/12/2024 23:28, Andres Freund wrote:
> On 2024-12-17 19:57:13 +0200, Heikki Linnakangas wrote:
>> On 14/12/2024 01:44, Andres Freund wrote:
>>> The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems
>>> less sane to me.  If you want to recover from a data corruption event and
>>> can't dump the data because a seqscan stumbles over an invalid page -
>>> zero_damaged_pages makes sense.
>>>
>>> Seqscans or tidscans won't reach the mdreadv() path, because they check the
>>> relation size first.  Which leaves access from indexes - e.g. an index pointer
>>> beyond the end of the heap.  But in that case it's not sane to use
>>> zero_damaged_pages, because that's almost a guarantee for worsening corruption
>>> in the future, because the now empty heap page will eventually be filled with
>>> new tuples, which now will be pointed to by index entries pointing that were
>>> created before the zeroing.
>>
>> Well, if you need to do zero_damage_pages=off, you're screwed already, so I
>> don't know think the worsening corruption argument matters much.
> 
> Well, it matters in the sense of it being important to keep seqscans somewhat
> working, as that's required for extracting as much data as possible with
> pg_dump. But I don't think there's an equivalent need to keep seqscans
> working, given that the only valid action is to reindex anyway.
> 
> 
>> And you have the same problem by pages zeroed by a seqscan too.
> 
> I don't think so? For seqscans we should *never* hit the "zero a page beyond
> EOF" path, because the heapscan will check the relation size at the start of
> the scan.  You definitely can hit the case of zeroing a heap page, but that
> page will still correspond to an on-disk page.

I meant that this scenario:

0. Heap block 123 has some live tuples on it, but it is corrupt.
1. set zero_damaged_pages=on
2. Perform seqscan. It zeroes block 123
3. set zero_damaged_pages=off
4. Insert new tuples. They get inserted to block 123.

Any index entries for the original heap tuples on block 123 that got 
zeroed out will now incorrectly point to the new tuples you inserted. No 
reads beyond EOF involved.

-- 
Heikki Linnakangas
Neon (https://neon.tech)