Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Robert Haas
Subject Re: making relfilenodes 56 bits
Date
Msg-id CA+Tgmob-J_70e47imyLV3Wr5Q8h21ijh=+QMsjx_hA2LMcC=gg@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: making relfilenodes 56 bits
List pgsql-hackers
On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think even if we start the range from the 4 billion we can not avoid
> keeping two separate ranges for system and user tables otherwise the
> next upgrade where old and new clusters both have 56 bits
> relfilenumber will get conflicting files.  And, for the same reason we
> still have to call SetNextRelFileNumber() during upgrade.

Well, my proposal to move everything from the new cluster up to higher
numbers would address this without requiring two ranges.

> So the idea is, we will be having 2 ranges for relfilenumbers, system
> range will start from 4 billion and user range maybe something around
> 4.1 (I think we can keep it very small though, just reserve 50k
> relfilenumber for system for future expansion and start user range
> from there).

A disadvantage of this is that it basically means all the file names
in new clusters are going to be 10 characters long. That's not a big
disadvantage, but it's not wonderful. File names that are only 5-7
characters long are common today, and easier to remember.

> So now system tables have no issues and also the user tables from the
> old cluster have no issues.  But pg_largeobject might get conflict
> when both old and new cluster are using 56 bits relfilenumber, because
> it is possible that in the new cluster some other system table gets
> that relfilenumber which is used by pg_largeobject in the old cluster.
>
> This could be resolved if we allocate pg_largeobject's relfilenumber
> from the user range, that means this relfilenumber will always be the
> first value from the user range.  So now if the old and new cluster
> both are using 56bits relfilenumber then pg_largeobject in both
> cluster would have got the same relfilenumber and if the old cluster
> is using the current 32 bits relfilenode system then the whole range
> of the new cluster is completely different than that of the old
> cluster.

I think this can work, but it does rely to some extent on the fact
that there are no other tables which need to be treated like
pg_largeobject. If there were others, they'd need fixed starting
RelFileNumber assignments, or some other trick, like renumbering them
twice in the cluster, first two a known-unused value and then back to
the proper value. You'd have trouble if in the other cluster
pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
cluster the reverse, without some hackery.

I do feel like your idea here has some advantages - my proposal
requires rewriting all the catalogs in the new cluster before we do
anything else, and that's going to take some time even though they
should be small. But I also feel like it has some disadvantages: it
seems to rely on complicated reasoning and special cases more than I'd
like.

What do other people think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size
Next
From: Zhihong Yu
Date:
Subject: Re: dropping datumSort field