Thread: Large files for relations

Large files for relations

From

Thomas Munro

Date:

02 May 2023, 01:28:58

Big PostgreSQL databases use and regularly open/close huge numbers of
file descriptors and directory entries for various anachronistic
reasons, one of which is the 1GB RELSEG_SIZE thing.  The segment
management code is trickier that you might think and also still
harbours known bugs.

A nearby analysis of yet another obscure segment life cycle bug
reminded me of this patch set to switch to simple large files and
eventually drop all that.  I originally meant to develop the attached
sketch-quality code further and try proposing it in the 16 cycle,
while I was down the modernisation rabbit hole[1], but then I got side
tracked: at some point I believed that the 56 bit relfilenode thing
might be necessary for correctness, but then I found a set of rules
that seem to hold up without that.  I figured I might as well post
what I have early in the 17 cycle as a "concept" patch to see which
way the flames blow.

There are various boring details due to Windows, and then a load of
fairly obvious changes, and then a whole can of worms about how we'd
handle the transition for the world's fleet of existing databases.
I'll cut straight to that part.  Different choices on aggressiveness
could be made, but here are the straw-man answers I came up with so
far:

1.  All new relations would be in large format only.  No 16384.N
files, just 16384 that can grow to MaxBlockNumber * BLCKSZ.

2.  The existence of a file 16384.1 means that this smgr relation is
in legacy segmented format that came from pg_upgrade (note that we
don't unlink that file once it exists, even when truncating the fork,
until we eventually drop the relation).

3.  Forks that were pg_upgrade'd from earlier releases using hard
links or reflinks would implicitly be in large format if they only had
one segment, and otherwise they could stay in the traditional format
for a grace period of N major releases, after which we'd plan to drop
segment support.  pg_upgrade's [ref]link mode would therefore be the
only way to get a segmented relation, other than a developer-only
trick for testing/debugging.

4.  Every opportunity to convert a multi-segment fork to large format
would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE,
VACUUM FULL, TRUNCATE, etc.  You can see approximately working sketch
versions of all the cases I thought of so far in the attached.

5.  The main places that do file-level copying of relations would use
copy_file_range() to do the splicing, so that on file systems that are
smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and
destination, the operation can be very fast, and other degrees of
optimisation are available to the kernel too even for file systems
without block sharing magic (pushing down block range copies to
hardware/network storage, etc).  The copy_file_range() stuff could
also be proposed independently (I vaguely recall it was discussed a
few times before), it's just that it really comes into its own when
you start splicing files together, as needed here, and it's also been
adopted by FreeBSD with the same interface as Linux and has an
efficient implementation in bleeding edge ZFS there.

Stepping back, the main ideas are: (1) for some users of large
databases, it would be painlessly done at upgrade time without even
really noticing, using modern file system facilities where possible
for speed; (2) for anyone who wants to defer that because of lack of
fast copy_file_range() and a desire to avoid prolonged downtime by
using links or reflinks, concatenation can be put off for the next N
releases, giving a total of 5 + N years of option to defer the work,
and in that case there are also many ways to proactively change to
large format before the time comes with varying degrees of granularity
and disruption.  For example, set up a new replica and fail over, or
VACUUM FULL tables one at a time, etc.

There are plenty of things left to do in this patch set: pg_rewind
doesn't understand optional segmentation yet, there are probably more
things like that, and I expect there are some ssize_t vs pgoff_t
confusions I missed that could bite a 32 bit system.  But you can see
the basics working on a typical system.

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease.  Anyone know of anything to worry about on that
front?  I think the main collateral damage would be weird old external
tools like some weird old version of Windows tar I occasionally see
mentioned, that sort of thing, but that'd just be another case of
"well don't use that then", I guess?  What else might we need to think
about, outside PostgreSQL?

What other problems might occur inside PostgreSQL?  Clearly we'd need
to figure out a decent strategy to automate testing of all of the
relevant transitions.  We could test the splicing code paths with an
optional test suite that you might enable along with a small segment
size (as we're already testing on CI and probably BF after the last
round of segmentation bugs).  To test the messy Windows off_t API
stuff convincingly, we'd need actual > 4GB files, I think?  Maybe
doable cheaply with file system hole punching tricks.

Speaking of file system holes, this patch set doesn't touch buffile.c
That code wants to use segments for two extra purposes: (1) parallel
create index merges workers' output using segmentation tricks as if
there were holes in the file; this could perhaps be replaced with
large files that make use of actual OS-level holes but I didn't feel
like additionally claiming that all computers have spare files --
perhaps another approach is needed anyway; (2) buffile.c deliberately
spreads large buffiles around across multiple temporary tablespaces
using segments supposedly for space management reasons.  So although
it initially looks like a nice safe little place to start using large
files, we'd need an answer to those design choices first.

/me dons flameproof suit and goes back to working on LLVM problems for a while

[1] https://wiki.postgresql.org/wiki/AllComputers
[2] https://en.wikipedia.org/wiki/Comparison_of_file_systems

Attachment

Re: Large files for relations

From

Pavel Stehule

Date:

02 May 2023, 03:27:38

I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioning usually.

Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issue should be solved.

Regards

Pavel

Re: Large files for relations

From

Thomas Munro

Date:

03 May 2023, 05:21:14

On Tue, May 2, 2023 at 3:28 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use
partitioningusually. 

Yeah, if you only use partitions of < 1GB it won't make a difference.
Larger partitions are not uncommon, though.

> Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this
issueshould be solved. 

Hmm, right, so there is a backup granularity continuum with "whole
database cluster" at one end, "only files whose size, mtime [or
optionally also checksum] changed since last backup" in the middle,
and "only blocks that changed since LSN of last backup" at the other
end.  Getting closer to the right end of that continuum can make
backups require less reading, less network transfer, less writing
and/or less storage space depending on details.  But this proposal
moves the middle thing further to the left by changing the granularity
from 1GB to whole relation, which can be gargantuan with this patch.
Ultimately we need to be all the way at the right on that continuum,
and there are clearly several people working on that goal.

I'm not involved in any of those projects, but it's fun to think about
an alien technology that produces complete standalone backups like
rsync --link-dest (as opposed to "full" backups followed by a chain of
"incremental" backups that depend on it so you need to retain them
carefully) while still sharing disk blocks with older backups, and
doing so with block granularity.  TL;DW something something WAL
something something copy_file_range().

Re: Large files for relations

From

Thomas Munro

Date:

03 May 2023, 05:36:42

On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> rsync --link-dest

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup).  Something
like --copy-range-dest.  That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.

Re: Large files for relations

From

Corey Huinker

Date:

09 May 2023, 20:52:49

On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> rsync --link-dest

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.

I understand the need to reduce open file handles, despite the possibilities enabled by using large numbers of small file sizes. Snowflake, for instance, sees everything in 1MB chunks, which makes massively parallel sequential scans (Snowflake's _only_ query plan) possible, though I don't know if they accomplish that via separate files, or via segments within a large file.

I am curious whether a move like this to create a generational change in file file format shouldn't be more ambitious, perhaps altering the block format to insert a block format version number, whether that be at every block, or every megabyte, or some other interval, and whether we store it in-file or in a separate file to accompany the first non-segmented. Having such versioning information would allow blocks of different formats to co-exist in the same table, which could be critical to future changes such as 64 bit XIDs, etc.

Re: Large files for relations

From

Stephen Frost

Date:

09 May 2023, 21:53:15

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
> > wrote:
> > > rsync --link-dest

... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.

> > I wonder if rsync will grow a mode that can use copy_file_range() to
> > share blocks with a reference file (= previous backup).  Something
> > like --copy-range-dest.  That'd work for large-file relations
> > (assuming a file system that has block sharing, like XFS and ZFS).
> > You wouldn't get the "mtime is enough, I don't even need to read the
> > bytes" optimisation, which I assume makes all database hackers feel a
> > bit queasy anyway, but you'd get the space savings via the usual
> > rolling checksum or a cheaper version that only looks for strong
> > checksum matches at the same offset, or whatever other tricks rsync
> > might have up its sleeve.

There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"

That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups.  There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).

> I understand the need to reduce open file handles, despite the
> possibilities enabled by using large numbers of small file sizes.

I'm also generally in favor of reducing the number of open file handles
that we have to deal with.  Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.

> I am curious whether a move like this to create a generational change in
> file file format shouldn't be more ambitious, perhaps altering the block
> format to insert a block format version number, whether that be at every
> block, or every megabyte, or some other interval, and whether we store it
> in-file or in a separate file to accompany the first non-segmented. Having
> such versioning information would allow blocks of different formats to
> co-exist in the same table, which could be critical to future changes such
> as 64 bit XIDs, etc.

To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.

https://commitfest.postgresql.org/43/3986/

Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page.  Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.

When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page.  I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..

Thanks,

Stephen

Attachment

signature.asc

Re: Large files for relations

From

Jim Mlodgenski

Date:

11 May 2023, 20:16:37

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Re: Large files for relations

From

Thomas Munro

Date:

11 May 2023, 23:37:33

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease.  Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number
ofusers of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut
themax table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this
change?

Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit).  It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size.  But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional.  For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC.  Likewise for
base backup.  Etc.  Then someone concerned about hitting the 16TB
limit on ext4 could opt out.  Or something like that.  It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

Re: Large files for relations

From

Dagfinn Ilmari Mannsåker

Date:

12 May 2023, 11:38:28

Thomas Munro <thomas.munro@gmail.com> writes:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
>> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> I am not aware of any modern/non-historic filesystem[2] that can't do
>>> large files with ease.  Anyone know of anything to worry about on that
>>> front?
>>
>> There is some trouble in the ambiguity of what we mean by "modern" and
>> "large files". There are still a large number of users of ext4 where
>> the max file size is 16TB. Switching to a single large file per
>> relation would effectively cut the max table size in half for those
>> users. How would a user with say a 20TB table running on ext4 be
>> impacted by this change?
[…]
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).

If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB?  The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.

- ilmari

Re: Large files for relations

From

Jim Mlodgenski

Date:

12 May 2023, 13:30:57

On Thu, May 11, 2023 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease. Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work,

Agreed, it is frustrating, but it is not hypothetical. I have seen a number of

users having single tables larger than 16TB and don't use partitioning because

of the limitations we have today. The most common reason is needing multiple

unique constraints on the table that don't include the partition key. Something

like a user_id and email. There are workarounds for those cases, but usually

it's easier to deal with a single large table than to deal with the sharp edges

those workarounds introduce.

Re: Large files for relations

From

Stephen Frost

Date:

12 May 2023, 13:53:32

Greetings,

* Dagfinn Ilmari Mannsåker (ilmari@ilmari.org) wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >>> I am not aware of any modern/non-historic filesystem[2] that can't do
> >>> large files with ease.  Anyone know of anything to worry about on that
> >>> front?
> >>
> >> There is some trouble in the ambiguity of what we mean by "modern" and
> >> "large files". There are still a large number of users of ext4 where
> >> the max file size is 16TB. Switching to a single large file per
> >> relation would effectively cut the max table size in half for those
> >> users. How would a user with say a 20TB table running on ext4 be
> >> impacted by this change?
> […]
> > A less aggressive version of the plan would be that we just keep the
> > segment code for the foreseeable future with no planned cut off, and
> > we make all of those "piggy back" transformations that I showed in the
> > patch set optional.  For example, I had it so that CLUSTER would
> > quietly convert your relation to large format, if it was still in
> > segmented format (might as well if you're writing all the data out
> > anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> > base backup.  Etc.  Then someone concerned about hitting the 16TB
> > limit on ext4 could opt out.  Or something like that.  It seems funny
> > though, that's exactly the user who should want this feature (they
> > have 16,000 relation segment files).
>
> If we're going to have to keep the segment code for the foreseeable
> future anyway, could we not get most of the benefit by increasing the
> segment size to something like 1TB?  The vast majority of tables would
> fit in one file, and there would be less risk of hitting filesystem
> limits.

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much.  Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things.  50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

Just some thoughts.

Thanks,

Stephen

Attachment

signature.asc

Re: Large files for relations

From

MARK CALLAGHAN

Date:

12 May 2023, 16:41:33

Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.

On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease. Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

Mark Callaghan
mdcallag@gmail.com

Re: Large files for relations

From

Thomas Munro

Date:

12 May 2023, 23:01:49

On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table
therewill be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem
sourcein a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked
withinthe kernel) and was briefly locked while setting up a read. 
>
> The workaround for writes was one of:
> 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior
to2010) 
> 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write
>
> I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Right, 80s file systems like UFS (and I suspect ext and ext2, which
were probably based on similar ideas and ran on non-SMP machines?)
used coarse grained locking including vnodes/inodes level.  Then over
time various OSes and file systems have improved concurrency.  Brief
digression, as someone who got started on IRIX in the 90 and still
thinks those were probably the coolest computers: At SGI, first they
replaced SysV UFS with EFS (E for extent-based allocation) and
invented O_DIRECT to skip the buffer pool, and then blew the doors off
everything with XFS, which maximised I/O concurrency and possibly (I
guess, it's not open source so who knows?) involved a revamped VFS to
lower stuff like inode locks, motivated by monster IRIX boxes with up
to 1024 CPUs and huge storage arrays.  In the Linux ext3 era, I
remember hearing lots of reports of various kinds of large systems
going faster just by switching to XFS and there is lots of writing
about that.  ext4 certainly changed enormously.  One reason back in
those days (mid 2000s?) was the old
fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file
thing, and another was the lack of write concurrency especially for
direct I/O, and probably lots more things.  But that's all ancient
history...

As for ext4, we've detected and debugged clues about the gradual
weakening of locking over time on this list: we know that concurrent
read/write to the same page of a file was previously atomic, but when
we switched to pread/pwrite for most data (ie not making use of the
current file position), it ceased to be (a concurrent reader can see a
mash-up of old and new data with visible cache line-ish stripes in it,
so there isn't even a write-lock for the page); then we noticed that
in later kernels even read/write ceased to be atomic (implicating a
change in file size/file position interlocking, I guess).  I also
vaguely recall reading on here a long time ago that lseek()
performance was dramatically improved with weaker inode interlocking,
perhaps even in response to this very program's pathological SEEK_END
call frequency (something I hope to fix, but I digress).  So I think
it's possible that the effect you mentioned is gone?

I can think of a few differences compared to those other RDBMSs.
There the discussion was about one-file-per-relation vs
one-big-file-for-everything, whereas we're talking about
one-file-per-relation vs many-files-per-relation (which doesn't change
the point much, just making clear that I'm not proposing a 42PB file
to whole everything, so you can still partition to get different
files).  We also usually call fsync in series in our checkpointer
(after first getting the writebacks started with sync_file_range()
some time sooner).  Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed.  There is no
basis for that in POSIX or in any system that I currently know of
(though I haven't looked into it seriously), but I believe there was a
historical file system that at some point in history interpreted
"non-essential meta data" (the stuff POSIX allows it not to flush to
disk) to include "the size of the file" (whereas POSIX really just
meant that you don't have to synchronise the mtime and similar), which
is probably why PostgreSQL has some code that calls fsync() on newly
created empty WAL segments to "make sure the indirect blocks are down
on disk" before allowing itself to use only fdatasync() later to
overwrite it with data.  The point being that, for the most important
kind of interactive/user facing I/O latency, namely WAL flushes, we
already use fdatasync().  It's possible that we could use it to flush
relation data too (ie the relation files in question here, usually
synchronised by the checkpointer) according to POSIX but it doesn't
immediately seem like something that should be at all hot and it's
background work.  But perhaps I lack imagination.

Thanks, thought-provoking stuff.

Re: Large files for relations

From

Thomas Munro

Date:

13 May 2023, 00:48:49

On Sat, May 13, 2023 at 11:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> > use XFS and O_DIRECT

As for direct I/O, we're only just getting started on that.  We
currently can't produce more than one concurrent WAL write, and then
for relation data, we just got very basic direct I/O support but we
haven't yet got the asynchronous machinery to drive it properly (work
in progress, more soon).  I was just now trying to find out what the
state of parallel direct writes is in ext4, and it looks like it's
finally happening:

https://www.phoronix.com/news/Linux-6.3-EXT4

Re: Large files for relations

From

MARK CALLAGHAN

Date:

15 May 2023, 16:43:17

On Fri, May 12, 2023 at 4:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.
>
> The workaround for writes was one of:
> 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
> 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write
>
> I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Yes, although when the decision was made it was probably ext-3 -> XFS. We suffered from fsync a file == fsync the filesystem

because MySQL binlogs use buffered IO and are appended on write. Switching from ext-? to XFS was an easy perf win

so I don't have much experience with ext-? over the past decade.

Right, 80s file systems like UFS (and I suspect ext and ext2, which

Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS and ext source. Unix was easy back then -- one big kernel lock covers everything.

some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no

Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended. Eventually we fixed that.

Thanks for all of the details.

Mark Callaghan
mdcallag@gmail.com

Re: Large files for relations

From

Robert Haas

Date:

15 May 2023, 17:55:56

On Fri, May 12, 2023 at 9:53 AM Stephen Frost <sfrost@snowman.net> wrote:
> While I tend to agree that 1GB is too small, 1TB seems like it's
> possibly going to end up on the too big side of things, or at least,
> if we aren't getting rid of the segment code then it's possibly throwing
> away the benefits we have from the smaller segments without really
> giving us all that much.  Going from 1G to 10G would reduce the number
> of open file descriptors by quite a lot without having much of a net
> change on other things.  50G or 100G would reduce the FD handles further
> but starts to make us lose out a bit more on some of the nice parts of
> having multiple segments.

This is my view as well, more or less. I don't really like our current
handling of relation segments; we know it has bugs, and making it
non-buggy feels difficult. And there are performance issues as well --
file descriptor consumption, for sure, but also probably that crossing
a file boundary likely breaks the operating system's ability to do
readahead to some degree. However, I think we're going to find that
moving to a system where we have just one file per relation fork and
that file can be arbitrarily large is not fantastic, either. Jim's
point about running into filesystem limits is a good one (hi Jim, long
time no see!) and the problem he points out with ext4 is almost
certainly not the only one. It doesn't just have to be filesystems,
either. It could be a limitation of an archiving tool (tar, zip, cpio)
or a file copy utility or whatever as well. A quick Google search
suggests that most such things have been updated to use 64-bit sizes,
but my point is that the set of things that can potentially cause
problems is broader than just the filesystem. Furthermore, even when
there's no hard limit at play, a smaller file size can occasionally be
*convenient*, as in Pavel's example of using hard links to share
storage between backups. From that point of view, a 16GB or 64GB or
256GB file size limit seems more convenient than no limit and more
convenient than a large limit like 1TB.

However, the bugs are the flies in the ointment (ahem). If we just
make the segment size bigger but don't get rid of segments altogether,
then we still have to fix the bugs that can occur when you do have
multiple segments. I think part of Thomas's motivation is to dodge
that whole category of problems. If we gradually deprecate
multi-segment mode in favor of single-file-per-relation-fork, then the
fact that the segment handling code has bugs becomes progressively
less relevant. While that does make some sense, I'm not sure I really
agree with the approach. The problem is that we're trading problems
that we at least theoretically can fix somehow by hitting our code
with a big enough hammer for an unknown set of problems that stem from
limitations of software we don't control, maybe don't even know about.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Large files for relations

From

Thomas Munro

Date:

24 May 2023, 00:34:27

Thanks all for the feedback.  It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode.  And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged
* pg_upgrade would convert if source and target don't match

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.

Re: Large files for relations

From

Peter Eisentraut

Date:

24 May 2023, 06:18:13

On 24.05.23 02:34, Thomas Munro wrote:
> Thanks all for the feedback.  It was a nice idea and it *almost*
> works, but it seems like we just can't drop segmented mode.  And the
> automatic transition schemes I showed don't make much sense without
> that goal.
> 
> What I'm hearing is that something simple like this might be more acceptable:
> 
> * initdb --rel-segsize (cf --wal-segsize), default unchanged

makes sense

> * pg_upgrade would convert if source and target don't match

This would be good, but it could also be an optional or later feature.

Maybe that should be a different mode, like 
--copy-and-adjust-as-necessary, so that users would have to opt into 
what would presumably be slower than plain --copy, rather than being 
surprised by it, if they unwittingly used incompatible initdb options.

> I would probably also leave out those Windows file API changes, too.
> --rel-segsize would simply refuse larger sizes until someone does the
> work on that platform, to keep the initial proposal small.

Those changes from off_t to pgoff_t?  Yes, it would be good to do 
without those.  Apart of the practical problems that have been brought 
up, this was a major annoyance with the proposed patch set IMO.

> I would probably leave the experimental copy_on_write() ideas out too,
> for separate discussion in a separate proposal.

right

Re: Large files for relations

From

Robert Haas

Date:

24 May 2023, 12:33:33

On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> > What I'm hearing is that something simple like this might be more acceptable:
> >
> > * initdb --rel-segsize (cf --wal-segsize), default unchanged
>
> makes sense

+1.

> > * pg_upgrade would convert if source and target don't match
>
> This would be good, but it could also be an optional or later feature.

+1. I think that would be nice to have, but not absolutely required.

IMHO it's best not to overcomplicate these projects. Not everything
needs to be part of the initial commit. If the initial commit happens
2 months from now and then stuff like this gets added over the next 8,
that's strictly better than trying to land the whole patch set next
March.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Large files for relations

From

Stephen Frost

Date:

25 May 2023, 17:08:40

Greetings,

* Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:
> On 24.05.23 02:34, Thomas Munro wrote:
> > Thanks all for the feedback.  It was a nice idea and it *almost*
> > works, but it seems like we just can't drop segmented mode.  And the
> > automatic transition schemes I showed don't make much sense without
> > that goal.
> >
> > What I'm hearing is that something simple like this might be more acceptable:
> >
> > * initdb --rel-segsize (cf --wal-segsize), default unchanged
>
> makes sense

Agreed, this seems alright in general.  Having more initdb-time options
to help with certain use-cases rather than having things be compile-time
is definitely just generally speaking a good direction to be going in,
imv.

> > * pg_upgrade would convert if source and target don't match
>
> This would be good, but it could also be an optional or later feature.

Agreed.

> Maybe that should be a different mode, like --copy-and-adjust-as-necessary,
> so that users would have to opt into what would presumably be slower than
> plain --copy, rather than being surprised by it, if they unwittingly used
> incompatible initdb options.

I'm curious as to why it would be slower than a regular copy..?

> > I would probably also leave out those Windows file API changes, too.
> > --rel-segsize would simply refuse larger sizes until someone does the
> > work on that platform, to keep the initial proposal small.
>
> Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> those.  Apart of the practical problems that have been brought up, this was
> a major annoyance with the proposed patch set IMO.
>
> > I would probably leave the experimental copy_on_write() ideas out too,
> > for separate discussion in a separate proposal.
>
> right

You mean copy_file_range() here, right?

Shouldn't we just add support for that today into pg_upgrade,
independently of this?  Seems like a worthwhile improvement even without
the benefit it would provide to changing segment sizes.

Thanks,

Stephen

Attachment

signature.asc

Re: Large files for relations

From

Thomas Munro

Date:

28 May 2023, 06:48:56

On Thu, May 25, 2023 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:
> > On 24.05.23 02:34, Thomas Munro wrote:
> > > * pg_upgrade would convert if source and target don't match
> >
> > This would be good, but it could also be an optional or later feature.
>
> Agreed.

OK.  I do have a patch for that, but I'll put that (+ copy_file_range)
aside for now so we can talk about the basic feature.  Without that,
pg_upgrade just rejects mismatching clusters as it always did, no
change required.

> > > I would probably also leave out those Windows file API changes, too.
> > > --rel-segsize would simply refuse larger sizes until someone does the
> > > work on that platform, to keep the initial proposal small.
> >
> > Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> > those.  Apart of the practical problems that have been brought up, this was
> > a major annoyance with the proposed patch set IMO.

+1, it was not nice.

Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.  Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).  Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.  That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it).  The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.

A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers.  It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent.  We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...).  We can address that
separately (perhaps by finding my old patch...)

Attachment

0001-Allow-relation-segment-size-to-be-set-by-initdb.patch

Re: Large files for relations

From

Thomas Munro

Date:

28 May 2023, 07:07:34

On Sun, May 28, 2023 at 2:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> (you'd need over 2 billion
> directories ...

directory *entries* (segment files), I meant to write there.

Re: Large files for relations

From

Peter Eisentraut

Date:

30 May 2023, 11:20:54

On 28.05.23 02:48, Thomas Munro wrote:
> Another potential option name would be --segsize, if we think we're
> going to use this for temp files too eventually.
> 
> Maybe it's not so beautiful to have that global variable
> rel_segment_size (which replaces REL_SEGSIZE everywhere).  Another
> idea would be to make it static in md.c and call smgrsetsegmentsize(),
> or something like that.

I think one way to look at this is that the segment size is a 
configuration property of the md.c smgr.  I have been thinking a bit 
about how smgr-level configuration could look.  You can't use a catalog 
table, but we also can't have smgr plugins get space in pg_control.

Anyway, I'm not asking you to design this now.  A global variable via 
pg_control seems fine for now.  But it wouldn't be an smgr API call, I 
think.

Re: Large files for relations

From

David Steele

Date:

12 June 2023, 08:52:56

On 5/28/23 08:48, Thomas Munro wrote:
> 
> Alright, since I had some time to kill in an airport, here is a
> starter patch for initdb --rel-segsize.  

I've gone through this patch and it looks pretty good to me. A few things:

+             * rel_setment_size, we will truncate the K+1st segment to 0 length

rel_setment_size -> rel_segment_size

+     * We used a phony GUC with a custome show function, because we don't

custome -> custom

+        if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

+    int64        relseg_size;    /* blocks per segment of large relation */

This will require PG_CONTROL_VERSION to be bumped -- but you are 
probably waiting until commit time to avoid annoying conflicts, though I 
don't think it is as likely as with CATALOG_VERSION_NO.

> Some random thoughts:
> 
> Another potential option name would be --segsize, if we think we're
> going to use this for temp files too eventually.

I feel like temp file segsize should be separately configurable for the 
same reason that we are leaving it as 1GB for now.

> Maybe it's not so beautiful to have that global variable
> rel_segment_size (which replaces REL_SEGSIZE everywhere).  

Maybe not, but it is the way these things are done in general, .e.g. 
wal_segment_size, so I don't think it will be too controversial.

> Another
> idea would be to make it static in md.c and call smgrsetsegmentsize(),
> or something like that.  That could be a nice place to compute the
> "shift" value up front, instead of computing it each time in
> blockno_to_segno(), but that's probably not worth bothering with (?).
> BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
> about the only place where someone could say that this change makes
> things worse for people not interested in the new feature, so I was
> careful to get rid of / and % operations with no-longer-constant RHS.

Right -- not sure we should be troubling ourselves with trying to 
optimize away ops that are very fast, unless they are computed trillions 
of times.

> I had to promote segment size to int64 (global variable, field in
> control file), because otherwise it couldn't represent
> --rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
> store the shift value instead of the size, or store the max block
> number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
> (with more branches to test for it).  The only problem I ran into with
> the larger type was that 'SHOW segment_size' now needs a custom show
> function because we don't have int64 GUCs.

A custom show function seems like a reasonable solution here.

> A C type confusion problem that I noticed: some code uses BlockNumber
> and some code uses int for segment numbers.  It's not really a
> reachable problem for practical reasons (you'd need over 2 billion
> directories and VFDs to reach it), but it's wrong to use int if
> segment size can be set as low as BLCKSZ (one file per block); you
> could have more segments than an int can represent.  We could go for
> uint32, BlockNumber or create SegmentNumber (which I think I've
> proposed before, and lost track of...).  We can address that
> separately (perhaps by finding my old patch...)

I think addressing this separately is fine, though maybe enforcing some 
reasonable minimum in initdb would be a good idea for this patch. For my 
2c SEGSIZE == BLOCKSZ just makes very little sense.

Lastly, I think the blockno_to_segno(), blockno_within_segment(), and 
blockno_to_seekpos() functions add enough readability that they should 
be committed regardless of how this patch proceeds.

Regards,
-David

Re: Large files for relations

From

Thomas Munro

Date:

03 July 2023, 22:41:14

On Mon, Jun 12, 2023 at 8:53 PM David Steele <david@pgmasters.net> wrote:
> +               if (strcmp(endptr, "kB") == 0)
>
> Why kB here instead of KB to match MB, GB, TB below?

Those are SI prefixes[1], and we use kB elsewhere too.  ("K" was used
for kelvins, so they went with "k" for kilo.  Obviously these aren't
fully SI, because B is supposed to mean bel.  A gigabel would be
pretty loud... more than "sufficient power to create a black hole"[2],
hehe.)

> +       int64           relseg_size;    /* blocks per segment of large relation */
>
> This will require PG_CONTROL_VERSION to be bumped -- but you are
> probably waiting until commit time to avoid annoying conflicts, though I
> don't think it is as likely as with CATALOG_VERSION_NO.

Oh yeah, thanks.

> > Another
> > idea would be to make it static in md.c and call smgrsetsegmentsize(),
> > or something like that.  That could be a nice place to compute the
> > "shift" value up front, instead of computing it each time in
> > blockno_to_segno(), but that's probably not worth bothering with (?).
> > BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
> > about the only place where someone could say that this change makes
> > things worse for people not interested in the new feature, so I was
> > careful to get rid of / and % operations with no-longer-constant RHS.
>
> Right -- not sure we should be troubling ourselves with trying to
> optimize away ops that are very fast, unless they are computed trillions
> of times.

This obviously has some things in common with David Christensen's
nearby patch for block sizes[3], and we should be shifting and masking
there too if that route is taken (as opposed to a specialise-the-code
route or somethign else).  My binary-log trick is probably a little
too cute though... I should probably just go and set a shift variable.

Thanks for looking!

[1] https://en.wikipedia.org/wiki/Metric_prefix
[2] https://en.wiktionary.org/wiki/gigabel
[3] https://www.postgresql.org/message-id/flat/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u%2BQHUwRa1Gg%40mail.gmail.com

Re: Large files for relations

From

Thomas Munro

Date:

06 March 2024, 21:54:46

Rebased.  I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff.  You snooze, you lose.  Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com

Attachment

v3-0001-Allow-relation-segment-size-to-be-set-by-initdb.patch

Re: Large files for relations

From

Peter Eisentraut

Date:

13 May 2024, 20:51:11

On 06.03.24 22:54, Thomas Munro wrote:
> Rebased.  I had intended to try to get this into v17, but a couple of
> unresolved problems came up while rebasing over the new incremental
> backup stuff.  You snooze, you lose.  Hopefully we can sort these out
> in time for the next commitfest:
> 
> * should pg_combinebasebackup read the control file to fetch the segment size?
> * hunt for other segment-size related problems that may be lurking in
> new incremental backup stuff
> * basebackup_incremental.c wants to use memory in proportion to
> segment size, which looks like a problem, and I wrote about that in a
> new thread[1]

Overall, I like this idea, and the patch seems to have many bases covered.

The patch will need a rebase.  I was able to test it on 
master@{2024-03-13}, but after that there are conflicts.

In .cirrus.tasks.yml, one of the test tasks uses 
--with-segsize-blocks=6, but you are removing that option.  You could 
replace that with something like

PG_TEST_INITDB_EXTRA_OPTS='--rel-segsize=48kB'

But that won't work exactly because

initdb: error: argument of --rel-segsize must be a power of two

I suppose that's ok as a change, since it makes the arithmetic more 
efficient.  But maybe it should be called out explicitly in the commit 
message.

If I run it with 64kB, the test pgbench/001_pgbench_with_server fails 
consistently, so it seems there is still a gap somewhere.

A minor point, the initdb error message

initdb: error: argument of --rel-segsize must be a multiple of BLCKSZ

would be friendlier if actually showed the value of the block size 
instead of just the symbol.  Similarly for the nearby error message 
about the off_t size.

In the control file, all the other fields use unsigned types.  Should 
relseg_size be uint64?

PG_CONTROL_VERSION needs to be changed.