Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: making relfilenodes 56 bits
Date
Msg-id CAA4eK1+1qS7RbcaVe7YrcZDu9=tv-YyRA1-b8sGK5-MnkTFubg@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, Aug 22, 2022 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > There was also an issue where the user table from the old cluster's
> > > relfilenode could conflict with the system table of the new cluster.
> > > As a solution currently for system table object (while creating
> > > storage first time) we are keeping the low range of relfilenumber,
> > > basically we are using the same relfilenumber as OID so that during
> > > upgrade the normal user table from the old cluster will not conflict
> > > with the system tables in the new cluster.  But with this solution
> > > Robert told me (in off list chat) a problem that in future if we want
> > > to make relfilenumber completely unique within a cluster by
> > > implementing the CREATEDB differently then we can not do that as we
> > > have created fixed relfilenodes for the system tables.
> > >
> > > I am not sure what exactly we can do to avoid that because even if we
> > > do something  to avoid that in the new cluster the old cluster might
> > > be already using the non-unique relfilenode so after upgrading the new
> > > cluster will also get those non-unique relfilenode.
> >
> > I think this aspect of the patch could use some more discussion.
> >
> > To recap, the problem is that pg_upgrade mustn't discover that a
> > relfilenode that is being migrated from the old cluster is being used
> > for some other table in the new cluster. Since the new cluster should
> > only contain system tables that we assume have never been rewritten,
> > they'll all have relfilenodes equal to their OIDs, and thus less than
> > 16384. On the other hand all the user tables from the old cluster will
> > have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> > which also gets migrated, is a special case. Since we don't change OID
> > assignments from version to version, it should have either the same
> > relfilenode value in the old and new clusters, if never rewritten, or
> > else the value in the old cluster will be greater than 16384, in which
> > case no conflict is possible.
> >
> > But if we just assign all relfilenode values from a central counter,
> > then we have got trouble. If the new version has more system catalog
> > tables than the old version, then some value that got used for a user
> > table in the old version might get used for a system table in the new
> > version, which is a problem. One idea for fixing this is to have two
> > RelFileNumber ranges: a system range (small values) and a user range.
> > System tables get values in the system range initially, and in the
> > user range when first rewritten. User tables always get values in the
> > user range. Everything works fine in this scenario except maybe for
> > pg_largeobject: what if it gets one value from the system range in the
> > old cluster, and a different value from the system range in the new
> > cluster, but some other system table in the new cluster gets the value
> > that pg_largeobject had in the old cluster? Then we've got trouble.
> >
>
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.
>

I think while rewriting of system table during the upgrade, we need to
ensure that it gets relfilenumber from the system range, otherwise, if
we allocate it from the user range, there will be a chance of conflict
with the user tables from the old cluster. Another way could be to set
the next-relfilenumber counter for the new cluster to the value from
the old cluster as mentioned by Robert in his previous email [1].

[1] -
https://www.postgresql.org/message-id/CA%2BTgmoYsNiF8JGZ%2BKp7Zgcct67Qk%2B%2BYAp%2B1ybOQ0qomUayn%2B7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Fix typo kill_prio_tuple
Next
From: Etsuro Fujita
Date:
Subject: Re: Fast COPY FROM based on batch insert