Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Robert Haas
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CA+TgmoYP=w5j3TCdmsZt=HtQMBe+E2X58YkZGvk+vvCdGStmWw@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
>    insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
>    location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
>   page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: installcheck-world concurrency issues
Next
From: Nikita Malakhov
Date:
Subject: Re: Pluggable toaster