Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CA+TgmoZTJsbJXoXWceQkaHdxjFz23BL=+xZiD64FgqVAQ77jPQ@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think 0002 and 0003 need more work yet; I'll try to write a review > of those soon. Regarding 0002: I don't particularly like the names BufTagCopyRelFileLocator and BufTagRelFileLocatorEquals. My suggestion is to rename BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it doesn't really make sense to me to talk about equality between values of different data types. Instead of BufTagCopyRelFileLocator I would prefer BufTagGetRelFileLocator. That would make it more similar to BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a good thing. Other than that I think 0002 seems fine. Regarding 0003: /* * Don't try to prefetch anything in this database until - * it has been created, or we might confuse the blocks of - * different generations, if a database OID or - * relfilenumber is reused. It's also more efficient than + * it has been created, because it's more efficient than * discovering that relations don't exist on disk yet with * ENOENT errors. */ I'm worried that this might not be correct. The comment changes here (and I think also in some other plces) imply that we've eliminated relfilenode ruse, but I think that's not true. createdb() and movedb() don't seem to be modified, so I think it's possible to just copy a template database over without change, which means that relfilenumbers and even relfilelocators could be reused. So I feel like maybe this and similar places shouldn't be modified in this way. Am I misunderstanding? /* - * Relfilenumbers are not unique in databases across tablespaces, so we need - * to allocate a new one in the new tablespace. + * Generate a new relfilenumber. Although relfilenumber are unique within a + * cluster, we are unable to use the old relfilenumber since unused + * relfilenumber are not unlinked until commit. So if within a + * transaction, if we set the old tablespace again, we will get conflicting + * relfilenumber file. */ - newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL, - rel->rd_rel->relpersistence); + newrelfilenumber = GetNewRelFileNumber(); I can't clearly understand this comment. Is it saying that the code which follows is broken and needs to be fixed by a future patch before things are OK again? If so, that's not good. - * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in - * catalog/catalog.c. + * callers should be GetNewOidWithIndex() in catalog/catalog.c. If there is only one, it should say "caller", not "callers". Orphan files are harmless --- at worst they waste a bit of disk space --- -because we check for on-disk collisions when allocating new relfilenumber -OIDs. So cleaning up isn't really necessary. +because relfilenumber is 56 bit wide so logically there should not be any +collisions. So cleaning up isn't really necessary. I don't agree that orphaned files are harmless, but changing that is beyond the scope of this patch. I think that the way you've ended the sentence isn't sufficiently clear and correct even if we accept the principle that orphaned files are harmless. What I think we should stay instead is "because the relfilenode counter is monotonically increasing. The maximum value is 2^56-1, and there is no provision for wraparound." + /* + * Check if we set the new relfilenumber then do we run out of the logged + * relnumber, if so then we need to WAL log again. Otherwise, just adjust + * the relnumbercount. + */ + relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber; + if (ShmemVariableCache->relnumbercount <= relnumbercount) + { + LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH); + ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH; + } + else + ShmemVariableCache->relnumbercount -= relnumbercount; Would it be clearer, here and elsewhere, if VariableCacheData tracked nextRelFileNumber and nextUnloggedRelFileNumber instead of nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea seems worth considering. + * Flush xlog record to disk before returning. To protect against file + * system changes reaching the disk before the XLOG_NEXT_RELFILENUMBER log. The way this is worded, you would need it to be just one sentence, like "Flush xlog record to disk before returning to protect against...". Or else add "this is," like "This is to protect against..." But I'm thinking maybe we could reword it a little more, perhaps something like this: "Flush xlog record to disk before returning. We want to be sure that the in-memory nextRelFileNumber value is always larger than any relfilenumber that is already in use on disk. To maintain that invariant, we must make sure that the record we just logged reaches the disk before any new files are created." This isn't a full review, I think, but I'm kind of out of time and energy for today. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: