Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CA+TgmoYsNiF8JGZ+Kp7Zgcct67Qk++YAp+1ybOQ0qomUayn+7A@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits |
List | pgsql-hackers |
On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > There was also an issue where the user table from the old cluster's > relfilenode could conflict with the system table of the new cluster. > As a solution currently for system table object (while creating > storage first time) we are keeping the low range of relfilenumber, > basically we are using the same relfilenumber as OID so that during > upgrade the normal user table from the old cluster will not conflict > with the system tables in the new cluster. But with this solution > Robert told me (in off list chat) a problem that in future if we want > to make relfilenumber completely unique within a cluster by > implementing the CREATEDB differently then we can not do that as we > have created fixed relfilenodes for the system tables. > > I am not sure what exactly we can do to avoid that because even if we > do something to avoid that in the new cluster the old cluster might > be already using the non-unique relfilenode so after upgrading the new > cluster will also get those non-unique relfilenode. I think this aspect of the patch could use some more discussion. To recap, the problem is that pg_upgrade mustn't discover that a relfilenode that is being migrated from the old cluster is being used for some other table in the new cluster. Since the new cluster should only contain system tables that we assume have never been rewritten, they'll all have relfilenodes equal to their OIDs, and thus less than 16384. On the other hand all the user tables from the old cluster will have relfilenodes greater than 16384, so we're fine. pg_largeobject, which also gets migrated, is a special case. Since we don't change OID assignments from version to version, it should have either the same relfilenode value in the old and new clusters, if never rewritten, or else the value in the old cluster will be greater than 16384, in which case no conflict is possible. But if we just assign all relfilenode values from a central counter, then we have got trouble. If the new version has more system catalog tables than the old version, then some value that got used for a user table in the old version might get used for a system table in the new version, which is a problem. One idea for fixing this is to have two RelFileNumber ranges: a system range (small values) and a user range. System tables get values in the system range initially, and in the user range when first rewritten. User tables always get values in the user range. Everything works fine in this scenario except maybe for pg_largeobject: what if it gets one value from the system range in the old cluster, and a different value from the system range in the new cluster, but some other system table in the new cluster gets the value that pg_largeobject had in the old cluster? Then we've got trouble. It doesn't help if we assign pg_largeobject a starting relfilenode from the user range, either: now a relfilenode that needs to end up containing the some user table from the old cluster might find itself blocked by pg_largeobject in the new cluster. One solution to all this is to do as Dilip proposes here: for system relations, keep assigning the OID as the initial relfilenumber. Actually, we really only need to do this for pg_largeobject; all the other relfilenumber values could be assigned from a counter, as long as they're assigned from a range distinct from what we use for user relations. But I don't really like that, because I feel like the whole thing where we start out with relfilenumber=oid is a recipe for hidden bugs. I believe we'd be better off if we decouple those concepts more thoroughly. So here's another idea: what if we set the next-relfilenumber counter for the new cluster to the value from the old cluster, and then rewrote all the (thus-far-empty) system tables? Then every system relation in the new cluster has a relfilenode value greater than any in use in the old cluster, so we can afterwards migrate over every relfilenode from the old cluster with no risk of conflicting with anything. Then all the special cases go away. We don't need system and user ranges for relfilenodes, and pg_largeobject's not a special case, either. We can assign relfilenode values to system relations in exactly the same we do for user relations: assign a value from the global counter and forget about it. If this cluster happens to be the "new cluster" for a pg_upgrade attempt, the procedure described at the beginning of this paragraph will move everything that might conflict out of the way. One thing to perhaps not like about this is that it's a little more expensive: clustering every system table in every database on a new cluster isn't completely free. Perhaps it's not expensive enough to be a big problem, though. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: