Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: problems with making relfilenodes 56-bits |
Date | |
Msg-id | CAEze2Wh3nv2YuK1y+o_yGC9Abx-1EDgyKZ324891mowxA1OinA@mail.gmail.com Whole thread Raw |
In response to | Re: problems with making relfilenodes 56-bits (Andres Freund <andres@anarazel.de>) |
Responses |
Re: problems with making relfilenodes 56-bits
|
List | pgsql-hackers |
On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote: > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > > > only store a 2 byte offset from the page header xl_prev within each record. > > > > With that small xl_prev we may not detect partial page writes in > > recycled segments; or other issues in the underlying file system. With > > small record sizes, the chance of returning incorrect data would be > > significant for small records (it would be approximately the chance of > > getting a record boundary on the underlying page boundary * chance of > > getting the same MAXALIGN-adjusted size record before the persistence > > boundary). That issue is part of the reason why my proposed change > > upthread still contains the full xl_prev. > > What exactly is the theory for this significant increase? I don't think > xl_prev provides a meaningful protection against torn pages in the first > place? XLog pages don't have checksums, so they do not provide torn page protection capabilities on their own. A singular xlog record is protected against torn page writes through the checksum that covers the whole record - if only part of the record was written, we can detect that through the mismatching checksum. However, if records end at the tear boundary, we must know for certain that any record that starts after the tear is the record that was written after the one before the tear. Page-local references/offsets would not work, because the record decoding doesn't know which xlog page the record should be located on; it could be both the version of the page before it was recycled, or the one after. Currently, we can detect this because the value of xl_prev will point to a record far in the past (i.e. not the expected value), but with a page-local version of xl_prev we would be less likely to detect torn pages (and thus be unable to handle this without risk of corruption) due to the significant chance of the truncated xl_prev value being the same in both the old and new record. Example: Page { [ record A ] | tear boundary | [ record B ] } gets recycled and receives a new record C at the place of A with the same length. With your proposal, record B would still be a valid record when it follows C; as the page-local serial number/offset reference to the previous record would still match after the torn write. With the current situation and a full LSN in xl_prev, the mismatching value in the xl_prev pointer allows us to detect this torn page write and halt replay, before redoing an old (incorrect) record. Kind regards, Matthias van de Meent PS. there are ideas floating around (I heard about this one from Heikki) where we could concatenate WAL records into one combined record that has only one shared xl_prev+crc; which would save these 12 bytes per record. However, that needs a lot of careful consideration to make sure that the persistence guarantee of operations doesn't get lost somewhere in the traffic.
pgsql-hackers by date: