Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CAFiTN-v7Jb_v+ACbN41HfYGxZeLihV7=4mcvwHgFysg86VqVhQ@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
List | pgsql-hackers |
On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Aug 9, 2022 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I think even if we start the range from the 4 billion we can not avoid > > > keeping two separate ranges for system and user tables otherwise the > > > next upgrade where old and new clusters both have 56 bits > > > relfilenumber will get conflicting files. And, for the same reason we > > > still have to call SetNextRelFileNumber() during upgrade. > > > > Well, my proposal to move everything from the new cluster up to higher > > numbers would address this without requiring two ranges. > > > > > So the idea is, we will be having 2 ranges for relfilenumbers, system > > > range will start from 4 billion and user range maybe something around > > > 4.1 (I think we can keep it very small though, just reserve 50k > > > relfilenumber for system for future expansion and start user range > > > from there). > > > > A disadvantage of this is that it basically means all the file names > > in new clusters are going to be 10 characters long. That's not a big > > disadvantage, but it's not wonderful. File names that are only 5-7 > > characters long are common today, and easier to remember. > > That's correct. > > > > So now system tables have no issues and also the user tables from the > > > old cluster have no issues. But pg_largeobject might get conflict > > > when both old and new cluster are using 56 bits relfilenumber, because > > > it is possible that in the new cluster some other system table gets > > > that relfilenumber which is used by pg_largeobject in the old cluster. > > > > > > This could be resolved if we allocate pg_largeobject's relfilenumber > > > from the user range, that means this relfilenumber will always be the > > > first value from the user range. So now if the old and new cluster > > > both are using 56bits relfilenumber then pg_largeobject in both > > > cluster would have got the same relfilenumber and if the old cluster > > > is using the current 32 bits relfilenode system then the whole range > > > of the new cluster is completely different than that of the old > > > cluster. > > > > I think this can work, but it does rely to some extent on the fact > > that there are no other tables which need to be treated like > > pg_largeobject. If there were others, they'd need fixed starting > > RelFileNumber assignments, or some other trick, like renumbering them > > twice in the cluster, first two a known-unused value and then back to > > the proper value. You'd have trouble if in the other cluster > > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new > > cluster the reverse, without some hackery. > > Agree, if it has more catalog like pg_largeobject then it would > require some hacking. > > > I do feel like your idea here has some advantages - my proposal > > requires rewriting all the catalogs in the new cluster before we do > > anything else, and that's going to take some time even though they > > should be small. But I also feel like it has some disadvantages: it > > seems to rely on complicated reasoning and special cases more than I'd > > like. > > One other advantage with your approach is that since we are starting > the "nextrelfilenumber" after the old cluster's relfilenumber range. > So only at the beginning we need to set the "nextrelfilenumber" but > after that while upgrading each object we don't need to set the > nextrelfilenumber every time because that is already higher than the > complete old cluster range. In other 2 approaches we will have to try > to set the nextrelfilenumber everytime we preserve the relfilenumber > during upgrade. I was also thinking that whether we will get the max "relfilenumber" from the old cluster at the cluster level or per database level? I mean if we want to get database level we can run simple query on pg_class and get it but there also we will need to see how to handle the mapped relation if they are rewritten? I don't think we can get the max relfilenumber from the old cluster at the cluster level. Maybe in the newer version we can expose a function from the server to just return the NextRelFileNumber and that would be the max relfilenumber but I'm not sure how to do that in the old version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: