Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id 20221004153409.ukahqfpudxvbiut4@awork3.anarazel.de
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: problems with making relfilenodes 56-bits  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:
> > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
> > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > > > only store a 2 byte offset from the page header xl_prev within each record.
> > >
> > > With that small xl_prev we may not detect partial page writes in
> > > recycled segments; or other issues in the underlying file system. With
> > > small record sizes, the chance of returning incorrect data would be
> > > significant for small records (it would be approximately the chance of
> > > getting a record boundary on the underlying page boundary * chance of
> > > getting the same MAXALIGN-adjusted size record before the persistence
> > > boundary). That issue is part of the reason why my proposed change
> > > upthread still contains the full xl_prev.
> >
> > What exactly is the theory for this significant increase? I don't think
> > xl_prev provides a meaningful protection against torn pages in the first
> > place?
> 
> XLog pages don't have checksums, so they do not provide torn page
> protection capabilities on their own.
> A singular xlog record is protected against torn page writes through
> the checksum that covers the whole record - if only part of the record
> was written, we can detect that through the mismatching checksum.
> However, if records end at the tear boundary, we must know for certain
> that any record that starts after the tear is the record that was
> written after the one before the tear. Page-local references/offsets
> would not work, because the record decoding doesn't know which xlog
> page the record should be located on; it could be both the version of
> the page before it was recycled, or the one after.
> Currently, we can detect this because the value of xl_prev will point
> to a record far in the past (i.e. not the expected value), but with a
> page-local version of xl_prev we would be less likely to detect torn
> pages (and thus be unable to handle this without risk of corruption)
> due to the significant chance of the truncated xl_prev value being the
> same in both the old and new record.

Think this is addressable, see below.


> Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> recycled and receives a new record C at the place of A with the same
> length.
> 
> With your proposal, record B would still be a valid record when it
> follows C; as the page-local serial number/offset reference to the
> previous record would still match after the torn write.
> With the current situation and a full LSN in xl_prev, the mismatching
> value in the xl_prev pointer allows us to detect this torn page write
> and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.


The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.


> PS. there are ideas floating around (I heard about this one from
> Heikki) where we could concatenate WAL records into one combined
> record that has only one shared xl_prev+crc; which would save these 12
> bytes per record. However, that needs a lot of careful consideration
> to make sure that the persistence guarantee of operations doesn't get
> lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: get rid of Abs()
Next
From: Peter Eisentraut
Date:
Subject: Allow tests to pass in OpenSSL FIPS mode