Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CA+TgmoZq5=LWDK7kHaUbmWXxcaTuw_QwafgG9dr-BaPym_U8WQ@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits |
List | pgsql-hackers |
Trying to compile with 0001 and 0002 applied and -Wall -Werror in use, I get: buf_init.c:119:4: error: implicit truncation from 'int' to bit-field changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion] CLEAR_BUFFERTAG(buf->tag); ^~~~~~~~~~~~~~~~~~~~~~~~~ ../../../../src/include/storage/buf_internals.h:122:14: note: expanded from macro 'CLEAR_BUFFERTAG' (a).forkNum = InvalidForkNumber, \ ^ ~~~~~~~~~~~~~~~~~ 1 error generated. More review comments: In pg_buffercache_pages_internal(), I suggest that we add an error check. If fctx->record[i].relfilenumber is greater than the largest value that can be represented as an OID, then let's do something like: ERROR: relfilenode is too large to be represented as an OID HINT: Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE That way, instead of confusing people by giving them an incorrect answer, we'll push them toward a step that they may have overlooked. In src/backend/access/transam/README, I think the sentence "So cleaning up isn't really necessary." isn't too helpful. I suggest replacing it with "Thus, on-disk collisions aren't possible." > I think it is in line with oidCount, what do you think? Oh it definitely is, and maybe it's OK the way you have it. But the OID stuff has wraparound to worry about, and this doesn't; and this has the SetNextRelFileNumber and that doesn't; so it is not necessarily the case that the design which is best for that case is also best for this case. I believe that the persistence model for SetNextRelFileNumber needs more thought. Right now I believe it's relying on the fact that, after we try to restore the dump, we'll try to perform a clean shutdown of the server before doing anything important, and that will persist the final value, whatever it ends up being. However, there's no comment explaining that theory of operation, and it seems pretty fragile anyway. What if things don't go as planned? Suppose the power goes out halfway through restoring the dump, and the user for some reason then gives up on running pg_upgrade and just tries to do random things with that server? Then I think there will be trouble, because nothing has updated the nextrelfilenumber value and yet there are potentially new files on disk. Maybe that's a stretch since I think other things might also break if you do that, but I'm also not sure that's the only scenario to worry about, especially if you factor in the possibility of future code changes, like changes to the timing of when we shut down and restart the server during pg_upgrade, or other uses of binary-upgrade mode, or whatever. I don't know. Perhaps it's not actually broken but I'm inclined to think it should be logging its changes. A related thought is that I don't think this patch has as many cross-checks as it could have. For instance, suppose that when we replay a WAL record that creates relation storage, we cross-check that the value is less than the counter. I think you have a check in there someplace that will error out if there is an actual collision -- although I can't find it at the moment, and possibly we want to add some comments there even if it's in existing code -- but this kind of thing would detect bugs that could lead to collisions even if no collision actually occurs, e.g. because a duplicate relfilenumber is used but in a different database or tablespace. It might be worth spending some time thinking about other possible cross-checks too. We're trying to create a system where the relfilenumber counter is always ahead of all the relfilenumbers used on disk, but the coupling between the relfilenumber-advancement machinery and the make-files-on-disk machinery is pretty loose, and so there is a risk that bugs could escape detection. Whatever we can do to increase the probability of noticing when things have gone wrong, and/or to notice it quicker, will be good. + if (!IsBinaryUpgrade) + elog(ERROR, "the RelFileNumber can be set only during binary upgrade"); I think you should remove the word "the". Primary error messages are written telegram-style and "the" is usually omitted, especially at the beginning of the message. + * This should not impact the performance, since we are not WAL logging + * it for every allocation, but only after allocating 512 RelFileNumber. I think this claim is overly bold, and it would be better if the current value of the constant weren't encoded in the comment. I'm not sure we really need this part of the comment at all, but if we do, maybe it should be reworded to something like: This is potentially a somewhat expensive operation, but fortunately we only need to do it for every VAR_RELNUMBER_PREFETCH new relfilenodes. Or maybe it's better to put this explanation in GetNewRelFileNumber instead, e.g. "If we run out of logged RelFileNumbers, then we must log more, and also wait for the xlog record to be flushed to disk. This is somewhat expensive, but hopefully VAR_RELNUMBER_PREFETCH is large enough that this doesn't slow things down too much." One thing that isn't great about this whole scheme is that it can lead to lock pile-ups. Once somebody is waiting for an XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend that tries to get a new relfilenumber is going to block waiting for RelFileNumberGenLock. I wonder whether this effect is observable in practice: suppose we just create relations in a tight loop from inside a stored procedure, and do that simultaneously in multiple backends? What does the wait event distribution look like? Can we observe a lot of RelFileNumberGenLock events or not really? I guess if we reduce VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but how small a value is needed? One thing we could think about doing here is try to stagger the xlog and the flush. When we've used VAR_RELNUMBER_PREFETCH/2 relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from where we are now, and remember the LSN. When we've used up our entire previous allocation, XLogFlush() that record before allowing the additional values to be used. The bookkeeping would be a bit more complicated than currently, but I don't think it would be too bad. I'm not sure how much it would actually help, though, or whether we need it. If new relfilenumbers are being used up really quickly, then maybe the record won't get flushed into the background before we run out of available numbers anyway, and if they aren't, then maybe it doesn't matter. On the other hand, even one transaction commit between when the record is logged and when we run out of the previous allocation is enough to force a flush, at least with synchronous_commit=on, so maybe the chances of being able to piggyback on an existing flush are not so bad after all. I'm not sure. + * Generate a new relfilenumber. We can not reuse the old relfilenumber + * because the unused relfilenumber files are not unlinked until the next + * checkpoint. So if move the relation to the old tablespace again, we + * will get the conflicting relfilenumber file. This is much clearer now but the grammar has some issues, e.g. "the unused relfilenumber" should be just "unused relfilenumber" and "So if move" is not right either. I suggest: We cannot reuse the old relfilenumber because of the possibility that that relation will be moved back to the original tablespace before the next checkpoint. At that point, the first segment of the main fork won't have been unlinked yet, and an attempt to create new relation storage with that same relfilenumber will fail." In theory I suppose there's another way we could solve this problem: keep using the same relfilenumber, and if the scenario described here occurs, just reuse the old file. The reason why we can't do that today is because we could be running with wal_level=minimal and replace a relation with one whose contents aren't logged. If WAL replay then replays the drop, we're in trouble. But if the only time we reuse a relfilenumber for new relation storage is when relations are moved around, then I think that scenario can't happen. However, I think assigning a new relfilenumber is probably better, because it gets us closer to a world in which relfilenumbers are never reused at all. It doesn't get us all the way there because of createdb() and movedb(), but it gets us closer and I prefer that. + * XXX although this all was true when the relfilenumbers were 32 bits wide but + * now the relfilenumbers are 56 bits wide so we don't have risk of + * relfilenumber being reused so in future we can immediately unlink the first + * segment as well. Although we can reuse the relfilenumber during createdb() + * using file copy method or during movedb() but the above scenario is only + * applicable when we create a new relation. Here is an edited version: XXX. Although all of this was true when relfilenumbers were 32 bits wide, they are now 56 bits wide and do not wrap around, so in the future we can change the code to immediately unlink the first segment of the relation along with all the others. We still do reuse relfilenumbers when createdb() is performed using the file-copy method or during movedb(), but the scenario described above can only happen when creating a new relation. I think that pg_filenode_relation, binary_upgrade_set_next_heap_relfilenode, and other functions that are now going to be accepting a RelFileNode using the SQL int8 datatype should bounds-check the argument. It could be <0 or >2^56, and I believe it'd be best to throw an error for that straight off. The three functions in pg_upgrade_support.c could share a static subroutine for this, to avoid duplicating code. This bounds-checking issue also applies to the -f argument to pg_checksums. I notice that the patch makes no changes to relmapper.c, and I think that's a problem. Notice in particular: #define MAX_MAPPINGS 62 /* 62 * 8 + 16 = 512 */ I believe that making RelFileNumber into a 64-bit value will cause the 8 in the calculation above to change to 16, defeating the intention that the size of the file ought to be the smallest imaginable size of a disk sector. It does seem like it would have been smart to include a StaticAssertStmt in this file someplace that checks that the data structure has the expected size, and now might be a good time, perhaps in a separate patch, to add one. If we do nothing fancy here, the maximum number of mappings will have to be reduced from 62 to 31, which is a problem because global/pg_filenode.map currently has 48 entries. We could try to arrange to squeeze padding out of the RelMapping struct, which would let us use just 12 bytes per mapping, which would increase the limit to 41, but that's still less than we're using already, never mind leaving room for future growth. I don't know what to do about this exactly. I believe it's been previously suggested that the actual minimum sector size on reasonably modern hardware is never as small as 512 bytes, so maybe the file size can just be increased to 1kB or something. If that idea is judged unsafe, I can think of two other possible approaches offhand. One is that we could move away from the idea of storing the OIDs in the file along with the RelFileNodes, and instead store the offset for a given RelFileNode at a fixed offset in the file. That would require either hard-wiring offset tables into the code someplace, or generating them as part of the build process, with separate tables for shared and database-local relation map files. The other is that we could have multiple 512-byte sectors and try to arrange for each relation to be in the same sector with the indexes of that relation, since the comments in relmapper.c say this: * aborts. An important factor here is that the indexes and toast table of * a mapped catalog must also be mapped, so that the rewrites/relocations of * all these files commit in a single map file update rather than being tied * to transaction commit. This suggests that atomicity is required across a table and its indexes, but that it's needed across arbitrary sets of entries in the file. Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC. --- a/src/include/catalog/pg_class.h +++ b/src/include/catalog/pg_class.h @@ -34,6 +34,13 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat /* oid */ Oid oid; + /* access method; 0 if not a table / index */ + Oid relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am); + + /* identifier of physical storage file */ + /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */ + int64 relfilenode BKI_DEFAULT(0); + /* class name */ NameData relname; @@ -49,13 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat /* class owner */ Oid relowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid); - /* access method; 0 if not a table / index */ - Oid relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am); - - /* identifier of physical storage file */ - /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */ - Oid relfilenode BKI_DEFAULT(0); - /* identifier of table space for relation (0 means default for database) */ Oid reltablespace BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_tablespace); As Andres said elsewhere, this stinks. Not sure what the resolution of the discussion over on the "AIX support" thread is going to be yet, but hopefully not this. + uint32 relNumber_low; /* relfilenumber 32 lower bits */ + uint32 relNumber_hi:24; /* relfilenumber 24 high bits */ + uint32 forkNum:8; /* fork number */ I still think we'd be better off with something like uint32 relForkDetails[2]. The bitfields would be nice if they meant that we didn't have to do bit-shifting and masking operations ourselves, but with the field split this way, we do anyway. So what's the point in mixing the approaches? * relNumber identifies the specific relation. relNumber corresponds to * pg_class.relfilenode (NOT pg_class.oid, because we need to be able * to assign new physical files to relations in some situations). - * Notice that relNumber is only unique within a database in a particular - * tablespace. + * Notice that relNumber is unique within a cluster. I think this paragraph would benefit from more revision. I think that we should just nuke the parenthesized part altogether, since we'll now never use pg_class.oid as relNumber, and to suggest otherwise is just confusing. As for the last sentence, "Notice that relNumber is unique within a cluster." isn't wrong, but I think we could be more precise and informative. Perhaps: "relNumber values are assigned by GetNewRelFileNumber(), which will only ever assign the same value once during the lifetime of a cluster. However, since CREATE DATABASE duplicates the relfilenumbers of the template database, the values are in practice only unique within a database, not globally." That's all I've got for now. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: