Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | 20220711185736.zc3ht4n56ceqmh6d@awork3.anarazel.de Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
Hi, On 2022-07-07 13:26:29 -0400, Robert Haas wrote: > We're trying to create a system where the relfilenumber counter is > always ahead of all the relfilenumbers used on disk, but the coupling > between the relfilenumber-advancement machinery and the > make-files-on-disk machinery is pretty loose, and so there is a risk > that bugs could escape detection. Whatever we can do to increase the > probability of noticing when things have gone wrong, and/or to notice > it quicker, will be good. ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover relfilenode - afaics there's no real connection to the tablespace anymore. That'd a) reduce the size of the index b) guarantee uniqueness across tablespaces. I don't know where we could fit a sanity check that connects to all databases and detects duplicates across all the pg_class instances. Perhaps pg_amcheck? It may be worth changing RelidByRelfilenumber() / its infrastructure to not use reltablespace anymore. > One thing we could think about doing here is try to stagger the xlog > and the flush. When we've used VAR_RELNUMBER_PREFETCH/2 > relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from > where we are now, and remember the LSN. When we've used up our entire > previous allocation, XLogFlush() that record before allowing the > additional values to be used. The bookkeeping would be a bit more > complicated than currently, but I don't think it would be too bad. I'm > not sure how much it would actually help, though, or whether we need > it. I think that's a very good idea. My concern around doing an XLogFlush() is that it could lead to a lot of tiny f[data]syncs(), because not much else needs to be written out. But the scheme you describe would likely lead the XLogFlush() flushing plenty other WAL writes out, addressing that. > If new relfilenumbers are being used up really quickly, then maybe > the record won't get flushed into the background before we run out of > available numbers anyway, and if they aren't, then maybe it doesn't > matter. On the other hand, even one transaction commit between when > the record is logged and when we run out of the previous allocation is > enough to force a flush, at least with synchronous_commit=on, so maybe > the chances of being able to piggyback on an existing flush are not so > bad after all. I'm not sure. Even if the record isn't yet flushed out by the time we need to, the deferred-ness means that there's a good chance more useful records can also be flushed out at the same time... > I notice that the patch makes no changes to relmapper.c, and I think > that's a problem. Notice in particular: > > #define MAX_MAPPINGS 62 /* 62 * 8 + 16 = 512 */ > > I believe that making RelFileNumber into a 64-bit value will cause the > 8 in the calculation above to change to 16, defeating the intention > that the size of the file ought to be the smallest imaginable size of > a disk sector. It does seem like it would have been smart to include a > StaticAssertStmt in this file someplace that checks that the data > structure has the expected size, and now might be a good time, perhaps > in a separate patch, to add one. +1 Perhaps MAX_MAPPINGS should be at least partially computed instead of doing the math in a comment? sizeof(RelMapping) could directly be used, and we could define SIZEOF_RELMAPFILE_START with a StaticAssert() enforcing it to be equal to offsetof(RelMapFile, mappings), if we move crc & pad to *before* mappings - afaics that should be entirely doable. > If we do nothing fancy here, the maximum number of mappings will have to be > reduced from 62 to 31, which is a problem because global/pg_filenode.map > currently has 48 entries. We could try to arrange to squeeze padding out of > the RelMapping struct, which would let us use just 12 bytes per mapping, > which would increase the limit to 41, but that's still less than we're using > already, never mind leaving room for future growth. Ugh. > I don't know what to do about this exactly. I believe it's been > previously suggested that the actual minimum sector size on reasonably > modern hardware is never as small as 512 bytes, so maybe the file size > can just be increased to 1kB or something. I'm not so sure that's a good idea - while the hardware sector size likely isn't 512 on much storage anymore, it's still the size that most storage protocols use. Which then means you need to be confident that you not just rely on storage atomicity, but also that nothing in the filesystem <-> block layer <-> driver stack somehow causes a single larger write to be split up into two. And if you use a filesystem with a smaller filesystem block size, there might not even be a choice for the write to be split into two writes. E.g. XFS still supports 512 byte blocks (although I think it's deprecating block size < 1024). Maybe the easiest fix here would be to replace the file atomically. Then we don't need this <= 512 byte stuff. These are done rarely enough that I don't think the overhead of creating a separate file, fsyncing that, renaming, fsyncing, would be a problem? Greetings, Andres Freund
pgsql-hackers by date: