Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id 20221004183003.wnzqretzgxaqdg77@awork3.anarazel.de
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: problems with making relfilenodes 56-bits
List pgsql-hackers
Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:
> On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:
> > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > > recycled and receives a new record C at the place of A with the same
> > > length.
> > >
> > > With your proposal, record B would still be a valid record when it
> > > follows C; as the page-local serial number/offset reference to the
> > > previous record would still match after the torn write.
> > > With the current situation and a full LSN in xl_prev, the mismatching
> > > value in the xl_prev pointer allows us to detect this torn page write
> > > and halt replay, before redoing an old (incorrect) record.
> >
> > In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> > As you specified it, C has the same length as A, so B's xl_prev will be the
> > same whether it's a page local offset or the full 8 bytes.
> >
> > The relevant protection against issues like this isn't xl_prev, it's the
> > CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> > than the offset.
>
> I'm really confused. xl_prev *is* a full-width LSN currently, as I
> understand it. So in the scenario that Matthias poses, let's say the
> segment was previously 000000010000000400000025 and now it's
> 000000010000000400000049. So if a given chunk of the page is leftover
> from when the page was 000000010000000400000025, it will have xl_prev
> values like 4/25xxxxxx. If it's been rewritten since the segment was
> recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
> whether record B has been overwritten with a new record since the
> segment was recycled. But if we stored only 2 bytes in each xl_prev
> field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid.  Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.


Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
   insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
   location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?


> A way to up the chances of detecting this case would be to store only
> 2 or 4 bytes of xl_prev on disk, but arrange to include the full
> xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.


> Then your chances of a collision are about 2^-32, or maybe more if you posit
> that CRC is a weak and crappy algorithm, but even then it's strictly better
> than just hoping that there isn't a tear point at a record boundary where
> the same length record precedes the tear in both the old and new WAL
> segments. However, on the flip side, even if you assume that CRC is a
> fantastic algorithm with beautiful and state-of-the-art bit mixing, the
> chances of it failing to notice the problem are still >0, whereas the
> current algorithm that compares the full xl_prev value is a sure
> thing. Because xl_prev values are never repeated, it's certain that when a
> segment is recycled, any values that were legal for the old one aren't legal
> in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
  page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early
Next
From: Andres Freund
Date:
Subject: Re: installcheck-world concurrency issues