Re: Lowering the default wal_blocksize to 4K - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Lowering the default wal_blocksize to 4K |
Date | |
Msg-id | 20231011024744.hyqhahep6lpvv4pp@awork3.anarazel.de Whole thread Raw |
In response to | Re: Lowering the default wal_blocksize to 4K (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2023-10-11 14:39:12 +1300, Thomas Munro wrote: > On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote: > > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > > > I was thinking we should perhaps do the opposite, namely getting rid of short > > > > page headers. The overhead in the "byte position" <-> LSN conversion due to > > > > the differing space is worse than the gain. Or do something inbetween - having > > > > the system ID in the header adds a useful crosscheck, but I'm far less > > > > convinced that having segment and block size in there, as 32bit numbers no > > > > less, is worthwhile. After all, if the system id matches, it's not likely that > > > > the xlog block or segment size differ. > > > > > > Hmm. I don't think we should remove those checks, as I can see people > > > that would want to change their XLog block size with e.g. > > > pg_reset_wal. > > > > I don't think that's something we need to address in every physical > > segment. For one, there's no option to do so. But more importantly, if they > > don't change the xlog block size, we'll just accept random WAL as well. If > > somebody goes to the trouble of writing a custom tool, they can live with the > > consequences of that potentially causing breakage. Particularly if the checks > > wouldn't meaningfully prevent that anyway. > > How about this idea: Put the system ID etc into the new record Robert > is proposing for the redo point, and also into the checkpoint record, > so that it's at both ends of the to-be-replayed range. I think that's a very good idea. > That just leaves the WAL segments in between. If you find yourself writing > a new record that would go in the first usable byte of a segment, insert a > new special system ID (etc) record that will be checked during replay. I don't see how we can do that without incuring a lot of overhead though. This determination would need to happen in ReserveXLogInsertLocation(), while holding the spinlock. Which is one of the most contended bits of code in postgres. The whole reason that we have this "byte pos" to LSN conversion stuff is to make the spinlock-protected part of ReserveXLogInsertLocation() as short as possible. > For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it: > those already form part of a chain of verification (xlp_rem_len, xl_crc) > that started on the preceding page, so it seems almost impossible to > accidentally replay from a segment that came from another system. But I think we might just be ok with logic similar to this, even for the non-contrecord case. If recovery starts in one segment where we have verified sysid, xlog block size etc and we encounter a WAL record starting on the first "content byte" of a segment, we can still verify that the prev LSN is correct etc. Sure, if you try hard you could come up with a scenario where you could mislead such a check, but we don't need to protect against intentional malice here, just against accidents. Greetings, Andres Freund
pgsql-hackers by date: