Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CA+TgmoaffGaEmEGjx1QQjL5c69iMpB4ytXsPAQ33XxtpgBfvWA@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > PFA, the remaining set of patches. It might need to fix some > indentation but lets first see how is the overall idea then we can > work on it So just playing around with this patch set, and also looking at the code a bit, here are a few random observations: - The patch assigns relfilenumbers starting with 1. I don't see any specific problem with that, but I wonder if it would be a good idea to start with a random larger value just in case we ever need some fixed values for some purpose or other. Maybe we should start with 100000 or something? - If I use ALTER TABLE .. SET TABLESPACE to move a table around, then the relfilenode changes each time, but if I use ALTER DATABASE .. SET TABLESPACE to move a database around, the relfilenodes don't change. So, what this guarantees is that if the same filename is used twice, it will be for the same relation and not some unrelated relation. That's enough to avoid the hazard described in the comments for mdunlink(), because that scenario intrinsically involves confusion caused by two relations using the same filename after an OID wraparound. And it also means that if we pursue the idea of using an end-of-recovery record in all cases, we don't need to start creating tombstones during crash recovery. The forced checkpoint at the end of crash recovery means we don't currently need to do that, but if we change that, then the same hazard would exist there as we already have in normal running, and this fixes it. However, I don't find it entirely obvious that there are no hazards of any kind stemming from repeated use of ALTER DATABASE .. SET TABLESPACE resulting in filenames getting reused. On the other hand avoiding filename reuse completely would be more work, not closely related to what the rest of the patch set does, probably somewhat controversial in terms of what it would have to do, and I'm not sure that we really need it. It does seem like it would be quite a bit easier to reason about, though, because the current guarantee is suspiciously similar to "we don't do X, except when we do." This is not really so much a review comment for Dilip as a request for input from others ... thoughts? - Again, not a review comment for this patch specifically, but I'm wondering if we could use this as infrastructure for a tool to clean orphaned files out of the data directory. Suppose we create a file for a new relation and then crash, leaving a potentially large file on disk that will never be removed. Well, if the relfilenumber as it exists on disk is not in pg_class and old enough that a transaction inserting into pg_class can't still be running, then it must be safe to remove that file. Maybe that's safe even today, but it's a little hard to reason about it in the face of a possible OID wraparound that might result in reusing the same numbers over again. It feels like this makes easier to identify which files are old stuff that can never again be touched. - I might be missing something here, but this isn't actually making the relfilenode 56 bits, is it? The reason to do that is to make the BufferTag smaller, so I expected to see that BufferTag either used bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or else that it just declared a single field for both as uint64 and used accessor macros or static inlines to separate them out. But it doesn't seem to do either of those things, which seems like it can't be right. On a related note, I think it would be better to declare RelFileNumber as an unsigned type even though we have no use for the high bit; we have, equally, no use for negative values. It's easier to reason about bit-shifting operations with unsigned types. - I also think that the cross-version compatibility stuff in pg_buffercache isn't quite right. It does values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it ought to do is dependent on the output type. If the output type is int8, then it ought to do values[1] = Int64GetDatum((int64) fctx->record[i].relfilenumber), and if it's OID, then it ought to do values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)). The macro that you use needs to be based on the output SQL type, not the C data type. - I think it might be a good idea to allocate RelFileNumbers in much smaller batches than we do OIDs. 8192 feels wasteful to me. It shouldn't practically matter, because if we have 56 bits of bit space and so even if we repeatedly allocate 2^13 RelFileNumbers and then crash, we can still crash 2^41 times before we completely run out of numbers, and 2 trillion crashes ought to be enough for anyone. But I see little benefit from being so profligate. You can allocate an OID as an identifier for a catalog tuple or a TOAST chunk, but a RelFileNumber requires a filesystem operation, so the amount of work that is needed to use up 8192 RelFileNumbers is a lot bigger than the amount of work required to use up 8192 OIDs. If we dropped this down to 128, or 64, or 256, would anything bad happen? - Do we really want GetNewRelFileNumber() to call access() just for a can't-happen scenario? Can't we catch this problem later when we actually go to create the files on disk? - The patch updates the comments in XLogPrefetcherNextBlock to talk about relfilenumbers being reused rather than relfilenodes being reused, which is fine except that we're sorta kinda not doing that any more as noted above. I don't really know what these comments ought to say instead but perhaps more than a mechanical update is in order. This applies, even more, to the comments above mdunlink(). Apart from updating the existing comments, I think that the patch needs a good explanation of the new scheme someplace, and what it does and doesn't guarantee, which relates to the point above about making sure we know exactly what we're guaranteeing and why. I don't know where exactly this text should be positioned yet, or what it should say, but it needs to go someplace. This is a fairly significant change and needs to be talked about somewhere. - I think there's still a bit of a terminology problem here. With the patch set, we use RelFileNumber to refer to a single, 56-bit integer and RelFileLocator to refer to that integer combined with the DB and TS OIDs. But sometimes in the comments we want to talk about the logical sequence of files that is identified by a RelFileLocator, and that's not quite the same as either of those things. For example, in tableam.h we currently say "This callback needs to create a new relation filenode for `rel`" and how should that be changed in this new naming? We're not creating a new RelFileNumber - those would need to be allocated, not created, as all the numbers in the universe exist already. Neither are we creating a new locator; that sounds like it means assembling it from pieces. What we're doing is creating the first of what may end up being a series of similarly-named files on disk. I'm not exactly sure how we can refer to that in a way that is clear, but it's a problem that arises here and here throughout the patch. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: