Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: making relfilenodes 56 bits
Date
Msg-id CAFiTN-usmDZxVjsdaAQ1wBa8DoGGUwx6uAOb0gnf60GdokF6FA@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: making relfilenodes 56 bits
List pgsql-hackers
On Thu, Aug 4, 2022 at 5:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > One solution to all this is to do as Dilip proposes here: for system
> > relations, keep assigning the OID as the initial relfilenumber.
> > Actually, we really only need to do this for pg_largeobject; all the
> > other relfilenumber values could be assigned from a counter, as long
> > as they're assigned from a range distinct from what we use for user
> > relations.
> >
> > But I don't really like that, because I feel like the whole thing
> > where we start out with relfilenumber=oid is a recipe for hidden bugs.
> > I believe we'd be better off if we decouple those concepts more
> > thoroughly. So here's another idea: what if we set the
> > next-relfilenumber counter for the new cluster to the value from the
> > old cluster, and then rewrote all the (thus-far-empty) system tables?
>
> You mean in a new cluster start the next-relfilenumber counter from
> the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
> we start next-relfilenumber after the range of the old cluster then we
> can also avoid the logic of SetNextRelFileNumber() during upgrade.
>
> My very initial idea around this was to start the next-relfilenumber
> directly from the 4 billion in the new cluster so there can not be any
> conflict and we don't even need to identify the highest value of used
> relfilenode in the old cluster.  In fact we don't need to rewrite the
> system table before upgrading I think.  So what do we lose with this?
> just 4 billion relfilenode? does that really matter provided the range
> we get with the 56 bits relfilenumber.

I think even if we start the range from the 4 billion we can not avoid
keeping two separate ranges for system and user tables otherwise the
next upgrade where old and new clusters both have 56 bits
relfilenumber will get conflicting files.  And, for the same reason we
still have to call SetNextRelFileNumber() during upgrade.

So the idea is, we will be having 2 ranges for relfilenumbers, system
range will start from 4 billion and user range maybe something around
4.1 (I think we can keep it very small though, just reserve 50k
relfilenumber for system for future expansion and start user range
from there).

So now system tables have no issues and also the user tables from the
old cluster have no issues.  But pg_largeobject might get conflict
when both old and new cluster are using 56 bits relfilenumber, because
it is possible that in the new cluster some other system table gets
that relfilenumber which is used by pg_largeobject in the old cluster.

This could be resolved if we allocate pg_largeobject's relfilenumber
from the user range, that means this relfilenumber will always be the
first value from the user range.  So now if the old and new cluster
both are using 56bits relfilenumber then pg_largeobject in both
cluster would have got the same relfilenumber and if the old cluster
is using the current 32 bits relfilenode system then the whole range
of the new cluster is completely different than that of the old
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Next
From: Junwang Zhao
Date:
Subject: Re: [PATCH] Add a inline function to eliminate duplicate code