Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CAFiTN-usmDZxVjsdaAQ1wBa8DoGGUwx6uAOb0gnf60GdokF6FA@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
On Thu, Aug 4, 2022 at 5:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > One solution to all this is to do as Dilip proposes here: for system > > relations, keep assigning the OID as the initial relfilenumber. > > Actually, we really only need to do this for pg_largeobject; all the > > other relfilenumber values could be assigned from a counter, as long > > as they're assigned from a range distinct from what we use for user > > relations. > > > > But I don't really like that, because I feel like the whole thing > > where we start out with relfilenumber=oid is a recipe for hidden bugs. > > I believe we'd be better off if we decouple those concepts more > > thoroughly. So here's another idea: what if we set the > > next-relfilenumber counter for the new cluster to the value from the > > old cluster, and then rewrote all the (thus-far-empty) system tables? > > You mean in a new cluster start the next-relfilenumber counter from > the highest relfilenode/Oid value in the old cluster right?. Yeah, if > we start next-relfilenumber after the range of the old cluster then we > can also avoid the logic of SetNextRelFileNumber() during upgrade. > > My very initial idea around this was to start the next-relfilenumber > directly from the 4 billion in the new cluster so there can not be any > conflict and we don't even need to identify the highest value of used > relfilenode in the old cluster. In fact we don't need to rewrite the > system table before upgrading I think. So what do we lose with this? > just 4 billion relfilenode? does that really matter provided the range > we get with the 56 bits relfilenumber. I think even if we start the range from the 4 billion we can not avoid keeping two separate ranges for system and user tables otherwise the next upgrade where old and new clusters both have 56 bits relfilenumber will get conflicting files. And, for the same reason we still have to call SetNextRelFileNumber() during upgrade. So the idea is, we will be having 2 ranges for relfilenumbers, system range will start from 4 billion and user range maybe something around 4.1 (I think we can keep it very small though, just reserve 50k relfilenumber for system for future expansion and start user range from there). So now system tables have no issues and also the user tables from the old cluster have no issues. But pg_largeobject might get conflict when both old and new cluster are using 56 bits relfilenumber, because it is possible that in the new cluster some other system table gets that relfilenumber which is used by pg_largeobject in the old cluster. This could be resolved if we allocate pg_largeobject's relfilenumber from the user range, that means this relfilenumber will always be the first value from the user range. So now if the old and new cluster both are using 56bits relfilenumber then pg_largeobject in both cluster would have got the same relfilenumber and if the old cluster is using the current 32 bits relfilenode system then the whole range of the new cluster is completely different than that of the old cluster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: