Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CAEze2Wh3nv2YuK1y+o_yGC9Abx-1EDgyKZ324891mowxA1OinA@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Responses Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
> > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > > only store a 2 byte offset from the page header xl_prev within each record.
> >
> > With that small xl_prev we may not detect partial page writes in
> > recycled segments; or other issues in the underlying file system. With
> > small record sizes, the chance of returning incorrect data would be
> > significant for small records (it would be approximately the chance of
> > getting a record boundary on the underlying page boundary * chance of
> > getting the same MAXALIGN-adjusted size record before the persistence
> > boundary). That issue is part of the reason why my proposed change
> > upthread still contains the full xl_prev.
>
> What exactly is the theory for this significant increase? I don't think
> xl_prev provides a meaningful protection against torn pages in the first
> place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.



pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: [patch] \g with multiple result sets and \watch with copy queries
Next
From: Tom Lane
Date:
Subject: Re: [patch] \g with multiple result sets and \watch with copy queries