Home > mailing lists

Exceptional md.c paths for recovery and zero_damaged_pages - Mailing list pgsql-hackers

From	Andres Freund
Subject	Exceptional md.c paths for recovery and zero_damaged_pages
Date	December 14, 2024 02:44:22
Msg-id	3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd Whole thread Raw
Responses	Re: Exceptional md.c paths for recovery and zero_damaged_pages Re: Exceptional md.c paths for recovery and zero_damaged_pages
List	pgsql-hackers

Tree view

Hi,

We have a fair number of special paths in md.c that are specific to
recovery. E.g. in mdreadv() we do:

v = _mdfd_getseg(reln, forknum, blocknum, false,
EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
and

/*
* We are at or past EOF, or we read a partial block at EOF.
* Normally this is an error; upper levels should never try to
* read a nonexistent block. However, if zero_damaged_pages
* is ON or we are InRecovery, we should instead return zeroes
* without complaining. This allows, for example, the case of
* trying to update a block that was later truncated away.
*/
if (zero_damaged_pages || InRecovery)
{
for (BlockNumber i = transferred_this_segment / BLCKSZ;
i < nblocks_this_segment;
++i)
memset(buffers[i], 0, BLCKSZ);
break;
}

There's more, but this is sufficient to discuss.

As far as I can tell, nearly all - including the above - InRecovery paths in
md.c are basically unreachable. And have been for quite a while.

XLogReadBufferExtended() takes care to

a) Create the fork if it doesn't yet exist.

b) Not to read from beyond EOF. If EOF is found, we extend the relation to be
large enough.

Which afaict should suffice to prevent needing the above?

There *are* a few paths that don't go through XLogReadBufferExtended(), but
they seem to have their own protection. We have a bunch of heap redo records
that access the VM. But the VM takes care to call vm_extend() before reading a
block beyond EOF.

The specific shape of that protection in XLogReadBufferExtended() has evolved
over time (e.g. in [1]), but it's been there in some form for a *long* time -
the original REDO/UNDO commit by Vadim (b58c0411bad4).

The InRecovery paths for _mdfd_getseg seem to originate in 2004's 303e46ea932
and the zero-beyond-eof seems to be from 2007's ef07221997e - although it was
just *restricted* to InRecovery in that commit.

What bothers me is that I do have some dim memory of understanding why we
might still need these paths, but I sure can't see it now.

They're not reached in our tests according to coverage [2], and nuking them
doesn't cause tests to fail, even if I run the tests with -Dsegsize_blocks=7.

Is this dead code that we've whacked around a good bunch for well over 10
years? I must be missing something.

I think the other InRecovery paths are probably reachable:

- The checks in mdtruncate() make sense, if e.g. two subsequent vacuums
truncated a relation twice in the same checkpoint and we crash during
replaying those records, we'll have the relation truncated to the shorter
size.

I wonder if we'd be better off moving that condition to an argument to
mdtruncate(), rather than checking InRecovery().

- The !InRecovery check in mdexist() is a recovery specific optimization. I
think there are some callers that suffer from the automatic closing and
don't need the protection, but that's a separate optimization.

- The mdprefetch() InRecovery check migtht be reachable, we'll call it from
the recovery readahead, which will before we extended the relation in
XLogReadBufferExtended().

Separately from the above, I am concerned about that zero_damaged_pages
specific path.

For one, it's a very narrow path, we don't take zero_damaged_pages into
account in mdreadv()'s call to _mdfd_getseg(), so there'll be an error if the
actual end of the relation is in the last segment.

But even if we fixed that, just making up a page of zeroes in mdreadv() seems
like a bad idea, we'll end up with a page in shared buffers that's well beyond
EOF. Which requires us to have some protections with scary error messages in
the relation extension paths.

If the page ends up being dirtied (e.g. because we initialize it), we'll write
it out and cause a file with holes - or hit ENOSPC, because we've not even
tried to make sure that the relation can be extended to cover it.

The zero_damaged_pages path in bufmgr.c makes sense to me, but this one seems
less sane to me. If you want to recover from a data corruption event and
can't dump the data because a seqscan stumbles over an invalid page -
zero_damaged_pages makes sense.

Seqscans or tidscans won't reach the mdreadv() path, because they check the
relation size first. Which leaves access from indexes - e.g. an index pointer
beyond the end of the heap. But in that case it's not sane to use
zero_damaged_pages, because that's almost a guarantee for worsening corruption
in the future, because the now empty heap page will eventually be filled with
new tuples, which now will be pointed to by index entries pointing that were
created before the zeroing.

So what's the point of having this path in mdreadv()?

Greetings,

Andres Freund

[1] https://postgr.es/m/CAM-w4HNUgUcJqDNbgfAU%3DYjksHZDPPbT7RH73pYrzd7ByYrjrA%40mail.gmail.com
[2] https://coverage.postgresql.org/src/backend/storage/smgr/md.c.gcov.html

pgsql-hackers by date:

From: Andres Freund
Date: 14 December 2024, 01:41:15
Subject: Re: Hot standby queries see transient all-zeros pages

From: Tom Lane
Date: 14 December 2024, 02:57:52
Subject: Re: Added schema level support for publication.

Exceptional md.c paths for recovery and zero_damaged_pages - Mailing list pgsql-hackers

Previous

Next