Re: Large files for relations - Mailing list pgsql-hackers

From MARK CALLAGHAN
Subject Re: Large files for relations
Date
Msg-id CAFbpF8O2BAyyn0gifSNfrdfUdvjf0vergwKUh9osG-O-W+4_pg@mail.gmail.com
Whole thread Raw
In response to Re: Large files for relations  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Large files for relations
List pgsql-hackers
Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.


On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease.  Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit).  It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size.  But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional.  For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC.  Likewise for
base backup.  Etc.  Then someone concerned about hitting the 16TB
limit on ext4 could opt out.  Or something like that.  It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).




--
Mark Callaghan
mdcallag@gmail.com

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: psql tests hangs
Next
From: Tom Lane
Date:
Subject: Re: psql tests hangs