Re: Make relfile tombstone files conditional on WAL level - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Make relfile tombstone files conditional on WAL level |
Date | |
Msg-id | 20210802223819.lpo4z5kigxyniytd@alap3.anarazel.de Whole thread Raw |
In response to | Re: Make relfile tombstone files conditional on WAL level (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Make relfile tombstone files conditional on WAL level
Re: Make relfile tombstone files conditional on WAL level |
List | pgsql-hackers |
Hi, On 2021-08-02 16:03:31 -0400, Robert Haas wrote: > The two most principled solutions to this problem that I can see are > (1) remove wal_level=minimal and I'm personally not opposed to this. It's not practically relevant and makes a lot of stuff more complicated. We imo should rather focus on optimizing the things wal_level=minimal accelerates a lot than adding complications for wal_level=minimal. Such optimizations would have practical relevance, and there's plenty low hanging fruits. > (2) use 64-bit relfilenodes. I have > been reluctant to support #1 because it's hard for me to believe that > there aren't cases where being able to skip a whole lot of WAL-logging > doesn't work out to a nice performance win, but I realize opinions on > that topic vary. And I'm pretty sure that Andres, at least, will hate > #2 because he's unhappy with the width of buffer tags already. Yep :/ I guess there's a somewhat hacky way to get somewhere without actually increasing the size. We could take 3 bytes from the fork number and use that to get to a 7 byte relfilenode portion. 7 bytes are probably enough for everyone. It's not like we can use those bytes in a useful way, due to alignment requirements. Declaring that the high 7 bytes are for the relNode portion and the low byte for the fork would still allow efficient comparisons and doesn't seem too ugly. > So I don't really have a good idea. I agree this tombstone system is a > bit of a wart, but I'm not sure that this patch really makes anything > any better, and I'm not really seeing another idea that seems better > either. > Maybe I am missing something... What I proposed in the past was to have a new shared table that tracks relfilenodes. I still think that's a decent solution for just the problem at hand. But it'd also potentially be the way to redesign relation forks and even slim down buffer tags: Right now a buffer tag is: - 4 byte tablespace oid - 4 byte database oid - 4 byte "relfilenode oid" (don't think we have a good name for this) - 4 byte fork number - 4 byte block number If we had such a shared table we could put at least tablespace, fork number into that table mapping them to an 8 byte "new relfilenode". That'd only make the "new relfilenode" unique within a database, but that'd be sufficient for our purposes. It'd give use a buffertag consisting out of the following: - 4 byte database oid - 8 byte "relfilenode" - 4 byte block number Of course, it'd add some complexity too, because a buffertag alone wouldn't be sufficient to read data (as you'd need the tablespace oid from elsewhere). But that's probably ok, I think all relevant places would have that information. It's probably possible to remove the database oid from the tag as well, but it'd make CREATE DATABASE tricker - we'd need to change the filenames of tables as we copy, to adjust them to the differing oid. Greetings, Andres Freund
pgsql-hackers by date: