Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: problems with making relfilenodes 56-bits |
Date | |
Msg-id | CA+TgmobzL+3SYaDfpzwFx+Fp6HjhMLNnjXGnBD+LLpaDYMhApA@mail.gmail.com Whole thread Raw |
In response to | Re: problems with making relfilenodes 56-bits (Andres Freund <andres@anarazel.de>) |
Responses |
Re: problems with making relfilenodes 56-bits
|
List | pgsql-hackers |
On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote: > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets > > recycled and receives a new record C at the place of A with the same > > length. > > > > With your proposal, record B would still be a valid record when it > > follows C; as the page-local serial number/offset reference to the > > previous record would still match after the torn write. > > With the current situation and a full LSN in xl_prev, the mismatching > > value in the xl_prev pointer allows us to detect this torn page write > > and halt replay, before redoing an old (incorrect) record. > > In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection? > As you specified it, C has the same length as A, so B's xl_prev will be the > same whether it's a page local offset or the full 8 bytes. > > The relevant protection against issues like this isn't xl_prev, it's the > CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather > than the offset. I'm really confused. xl_prev *is* a full-width LSN currently, as I understand it. So in the scenario that Matthias poses, let's say the segment was previously 000000010000000400000025 and now it's 000000010000000400000049. So if a given chunk of the page is leftover from when the page was 000000010000000400000025, it will have xl_prev values like 4/25xxxxxx. If it's been rewritten since the segment was recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell whether record B has been overwritten with a new record since the segment was recycled. But if we stored only 2 bytes in each xl_prev field, that would no longer be possible. So I'm lost. It seems like Matthias has correctly identified a real hazard, and not some weird corner case but actually something that will happen regularly. All you need is for the old segment that got recycled to have a record stating at the same place where the page tore, and for the previous record to have been the same length as the one on the new page. Given that there's only <~1024 places on a page where a record can start, and given that in many workloads the lengths of WAL records will be fairly uniform, this doesn't seem unlikely at all. A way to up the chances of detecting this case would be to store only 2 or 4 bytes of xl_prev on disk, but arrange to include the full xl_prev value in the xl_crc calculation. Then your chances of a collision are about 2^-32, or maybe more if you posit that CRC is a weak and crappy algorithm, but even then it's strictly better than just hoping that there isn't a tear point at a record boundary where the same length record precedes the tear in both the old and new WAL segments. However, on the flip side, even if you assume that CRC is a fantastic algorithm with beautiful and state-of-the-art bit mixing, the chances of it failing to notice the problem are still >0, whereas the current algorithm that compares the full xl_prev value is a sure thing. Because xl_prev values are never repeated, it's certain that when a segment is recycled, any values that were legal for the old one aren't legal in the new one. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: