Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CAFiTN-t9j=z423_UcvrdwD0Hry5GBuSXJEU72sftRec9dM-1ZA@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
On Thu, Jul 7, 2022 at 10:56 PM Robert Haas <robertmhaas@gmail.com> wrote: I have accepted all the suggestion, find my inline replies where we need more thoughts. > buf_init.c:119:4: error: implicit truncation from 'int' to bit-field > changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion] > CLEAR_BUFFERTAG(buf->tag); > ^~~~~~~~~~~~~~~~~~~~~~~~~ > ../../../../src/include/storage/buf_internals.h:122:14: note: expanded > from macro 'CLEAR_BUFFERTAG' > (a).forkNum = InvalidForkNumber, \ > ^ ~~~~~~~~~~~~~~~~~ > 1 error generated. Hmm so now we are using an unsigned int field so IMHO we can make InvalidForkNumber to 255 instead of -1? > > I think it is in line with oidCount, what do you think? > > Oh it definitely is, and maybe it's OK the way you have it. But the > OID stuff has wraparound to worry about, and this doesn't; and this > has the SetNextRelFileNumber and that doesn't; so it is not > necessarily the case that the design which is best for that case is > also best for this case. Yeah right, but now with the latest changes for piggybacking the XlogFlush I think it is cleaner to have the count. > I believe that the persistence model for SetNextRelFileNumber needs > more thought. Right now I believe it's relying on the fact that, after > we try to restore the dump, we'll try to perform a clean shutdown of > the server before doing anything important, and that will persist the > final value, whatever it ends up being. However, there's no comment > explaining that theory of operation, and it seems pretty fragile > anyway. What if things don't go as planned? Suppose the power goes out > halfway through restoring the dump, and the user for some reason then > gives up on running pg_upgrade and just tries to do random things with > that server? Then I think there will be trouble, because nothing has > updated the nextrelfilenumber value and yet there are potentially new > files on disk. Maybe that's a stretch since I think other things might > also break if you do that, but I'm also not sure that's the only > scenario to worry about, especially if you factor in the possibility > of future code changes, like changes to the timing of when we shut > down and restart the server during pg_upgrade, or other uses of > binary-upgrade mode, or whatever. I don't know. Perhaps it's not > actually broken but I'm inclined to think it should be logging its > changes. But we are already logging this if we are setting the relfilenumber which is out of the already logged range, am I missing something? Check this change. + relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber; + if (ShmemVariableCache->relnumbercount <= relnumbercount) + { + LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL); + ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH; + } + else + ShmemVariableCache->relnumbercount -= relnumbercount; > A related thought is that I don't think this patch has as many > cross-checks as it could have. For instance, suppose that when we > replay a WAL record that creates relation storage, we cross-check that > the value is less than the counter. I think you have a check in there > someplace that will error out if there is an actual collision -- > although I can't find it at the moment, and possibly we want to add > some comments there even if it's in existing code -- but this kind of > thing would detect bugs that could lead to collisions even if no > collision actually occurs, e.g. because a duplicate relfilenumber is > used but in a different database or tablespace. It might be worth > spending some time thinking about other possible cross-checks too. > We're trying to create a system where the relfilenumber counter is > always ahead of all the relfilenumbers used on disk, but the coupling > between the relfilenumber-advancement machinery and the > make-files-on-disk machinery is pretty loose, and so there is a risk > that bugs could escape detection. Whatever we can do to increase the > probability of noticing when things have gone wrong, and/or to notice > it quicker, will be good. I had those changes in v7-0003, now I have merged with 0002. This has assert check while replaying the WAL for smgr create and smgr truncate, and while during normal path when allocating the new relfilenumber we are asserting for any existing file. > One thing that isn't great about this whole scheme is that it can lead > to lock pile-ups. Once somebody is waiting for an > XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend > that tries to get a new relfilenumber is going to block waiting for > RelFileNumberGenLock. I wonder whether this effect is observable in > practice: suppose we just create relations in a tight loop from inside > a stored procedure, and do that simultaneously in multiple backends? > What does the wait event distribution look like? Can we observe a lot > of RelFileNumberGenLock events or not really? I guess if we reduce > VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but > how small a value is needed? I have done some performance tests, with very small values I can see a lot of wait events for RelFileNumberGen but with bigger numbers like 256 or 512 it is not really bad. See results at the end of the mail[1] > One thing we could think about doing here is try to stagger the xlog > and the flush. When we've used VAR_RELNUMBER_PREFETCH/2 > relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from > where we are now, and remember the LSN. When we've used up our entire > previous allocation, XLogFlush() that record before allowing the > additional values to be used. The bookkeeping would be a bit more > complicated than currently, but I don't think it would be too bad. I'm > not sure how much it would actually help, though, or whether we need > it. If new relfilenumbers are being used up really quickly, then maybe > the record won't get flushed into the background before we run out of > available numbers anyway, and if they aren't, then maybe it doesn't > matter. On the other hand, even one transaction commit between when > the record is logged and when we run out of the previous allocation is > enough to force a flush, at least with synchronous_commit=on, so maybe > the chances of being able to piggyback on an existing flush are not so > bad after all. I'm not sure. I have done these changes during GetNewRelFileNumber() this required to track the last logged record pointer as well but I think this looks clean. With this I can see some reduction in RelFileNumberGen wait event[1] > In theory I suppose there's another way we could solve this problem: > keep using the same relfilenumber, and if the scenario described here > occurs, just reuse the old file. The reason why we can't do that today > is because we could be running with wal_level=minimal and replace a > relation with one whose contents aren't logged. If WAL replay then > replays the drop, we're in trouble. But if the only time we reuse a > relfilenumber for new relation storage is when relations are moved > around, then I think that scenario can't happen. However, I think > assigning a new relfilenumber is probably better, because it gets us > closer to a world in which relfilenumbers are never reused at all. It > doesn't get us all the way there because of createdb() and movedb(), > but it gets us closer and I prefer that. I agree with you. > I notice that the patch makes no changes to relmapper.c, and I think > that's a problem. Notice in particular: > > #define MAX_MAPPINGS 62 /* 62 * 8 + 16 = 512 */ > > I believe that making RelFileNumber into a 64-bit value will cause the > 8 in the calculation above to change to 16, defeating the intention > that the size of the file ought to be the smallest imaginable size of > a disk sector. It does seem like it would have been smart to include a > StaticAssertStmt in this file someplace that checks that the data > structure has the expected size, and now might be a good time, perhaps > in a separate patch, to add one. If we do nothing fancy here, the > maximum number of mappings will have to be reduced from 62 to 31, > which is a problem because global/pg_filenode.map currently has 48 > entries. We could try to arrange to squeeze padding out of the > RelMapping struct, which would let us use just 12 bytes per mapping, > which would increase the limit to 41, but that's still less than we're > using already, never mind leaving room for future growth. > > I don't know what to do about this exactly. I believe it's been > previously suggested that the actual minimum sector size on reasonably > modern hardware is never as small as 512 bytes, so maybe the file size > can just be increased to 1kB or something. If that idea is judged > unsafe, I can think of two other possible approaches offhand. One is > that we could move away from the idea of storing the OIDs in the file > along with the RelFileNodes, and instead store the offset for a given > RelFileNode at a fixed offset in the file. That would require either > hard-wiring offset tables into the code someplace, or generating them > as part of the build process, with separate tables for shared and > database-local relation map files. The other is that we could have > multiple 512-byte sectors and try to arrange for each relation to be > in the same sector with the indexes of that relation, since the > comments in relmapper.c say this: > > * aborts. An important factor here is that the indexes and toast table of > * a mapped catalog must also be mapped, so that the rewrites/relocations of > * all these files commit in a single map file update rather than being tied > * to transaction commit. > > This suggests that atomicity is required across a table and its > indexes, but that it's needed across arbitrary sets of entries in the > file. > > Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC. I am not sure what is the best solution here, but I agree that most of the modern hardware will have bigger sector size than 512 so we can just change file size of 1024. The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure how this version ID is decide is this some random magic number or based on some logic? > > + uint32 relNumber_low; /* relfilenumber 32 lower bits */ > + uint32 relNumber_hi:24; /* relfilenumber 24 high bits */ > + uint32 forkNum:8; /* fork number */ > > I still think we'd be better off with something like uint32 > relForkDetails[2]. The bitfields would be nice if they meant that we > didn't have to do bit-shifting and masking operations ourselves, but > with the field split this way, we do anyway. So what's the point in > mixing the approaches? Actually with this we were able to access the forkNum directly, but I also think changing as relForkDetails[2] is cleaner so done that. And as part of the related changes in 0001 I have removed the direct access to the forkNum. [1] Wait event details Procedure: CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$ DECLARE relname varchar; pid int; i int; BEGIN SELECT pg_backend_pid() INTO pid; relname := 'test_' || pid; FOR i IN 1..count LOOP EXECUTE format('CREATE TABLE %s(a int)', relname); EXECUTE format('DROP TABLE %s', relname); END LOOP; END; Target test: Executed "select create_table(100);" query from pgbench with 32 concurrent backends. VAR_RELNUMBER_PREFETCH = 8 905 LWLock | LockManager 346 LWLock | RelFileNumberGen 192 190 Activity | WalWriterMain VAR_RELNUMBER_PREFETCH=128 1187 LWLock | LockManager 247 LWLock | RelFileNumberGen 139 Activity | CheckpointerMain VAR_RELNUMBER_PREFETCH=256 1029 LWLock | LockManager 158 LWLock | BufferContent 134 Activity | CheckpointerMain 134 Activity | AutoVacuumMain 133 Activity | BgWriterMain 132 Activity | WalWriterMain 130 Activity | LogicalLauncherMain 123 LWLock | RelFileNumberGen VAR_RELNUMBER_PREFETCH=512 1174 LWLock | LockManager 136 Activity | CheckpointerMain 136 Activity | BgWriterMain 136 Activity | AutoVacuumMain 134 Activity | WalWriterMain 134 Activity | LogicalLauncherMain 99 LWLock | BufferContent 35 LWLock | RelFileNumberGen VAR_RELNUMBER_PREFETCH=2048 1070 LWLock | LockManager 160 LWLock | BufferContent 156 Activity | CheckpointerMain 156 155 Activity | BgWriterMain 154 Activity | AutoVacuumMain 153 Activity | WalWriterMain 149 Activity | LogicalLauncherMain 31 LWLock | RelFileNumberGen 28 Timeout | VacuumDelay VAR_RELNUMBER_PREFETCH=4096 Note, no wait event for RelFileNumberGen at value 4096 New patch with piggybacking XLogFlush() VAR_RELNUMBER_PREFETCH = 8 1105 LWLock | LockManager 143 LWLock | BufferContent 140 Activity | CheckpointerMain 140 Activity | BgWriterMain 139 Activity | WalWriterMain 138 Activity | AutoVacuumMain 137 Activity | LogicalLauncherMain 115 LWLock | RelFileNumberGen VAR_RELNUMBER_PREFETCH = 256 1130 LWLock | LockManager 141 Activity | CheckpointerMain 139 Activity | BgWriterMain 137 Activity | AutoVacuumMain 136 Activity | LogicalLauncherMain 135 Activity | WalWriterMain 69 LWLock | BufferContent 31 LWLock | RelFileNumberGen VAR_RELNUMBER_PREFETCH = 1024 Note: no wait event for RelFileNumberGen at value 1024 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: