Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Robert Haas
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CA+TgmobzL+3SYaDfpzwFx+Fp6HjhMLNnjXGnBD+LLpaDYMhApA@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Responses Re: problems with making relfilenodes 56-bits
List pgsql-hackers
On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:
> > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > recycled and receives a new record C at the place of A with the same
> > length.
> >
> > With your proposal, record B would still be a valid record when it
> > follows C; as the page-local serial number/offset reference to the
> > previous record would still match after the torn write.
> > With the current situation and a full LSN in xl_prev, the mismatching
> > value in the xl_prev pointer allows us to detect this torn page write
> > and halt replay, before redoing an old (incorrect) record.
>
> In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> As you specified it, C has the same length as A, so B's xl_prev will be the
> same whether it's a page local offset or the full 8 bytes.
>
> The relevant protection against issues like this isn't xl_prev, it's the
> CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: interrupted tap tests leave postgres instances around
Next
From: Jeff Davis
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early