Thread: Large files for relations
Big PostgreSQL databases use and regularly open/close huge numbers of file descriptors and directory entries for various anachronistic reasons, one of which is the 1GB RELSEG_SIZE thing. The segment management code is trickier that you might think and also still harbours known bugs. A nearby analysis of yet another obscure segment life cycle bug reminded me of this patch set to switch to simple large files and eventually drop all that. I originally meant to develop the attached sketch-quality code further and try proposing it in the 16 cycle, while I was down the modernisation rabbit hole[1], but then I got side tracked: at some point I believed that the 56 bit relfilenode thing might be necessary for correctness, but then I found a set of rules that seem to hold up without that. I figured I might as well post what I have early in the 17 cycle as a "concept" patch to see which way the flames blow. There are various boring details due to Windows, and then a load of fairly obvious changes, and then a whole can of worms about how we'd handle the transition for the world's fleet of existing databases. I'll cut straight to that part. Different choices on aggressiveness could be made, but here are the straw-man answers I came up with so far: 1. All new relations would be in large format only. No 16384.N files, just 16384 that can grow to MaxBlockNumber * BLCKSZ. 2. The existence of a file 16384.1 means that this smgr relation is in legacy segmented format that came from pg_upgrade (note that we don't unlink that file once it exists, even when truncating the fork, until we eventually drop the relation). 3. Forks that were pg_upgrade'd from earlier releases using hard links or reflinks would implicitly be in large format if they only had one segment, and otherwise they could stay in the traditional format for a grace period of N major releases, after which we'd plan to drop segment support. pg_upgrade's [ref]link mode would therefore be the only way to get a segmented relation, other than a developer-only trick for testing/debugging. 4. Every opportunity to convert a multi-segment fork to large format would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE, VACUUM FULL, TRUNCATE, etc. You can see approximately working sketch versions of all the cases I thought of so far in the attached. 5. The main places that do file-level copying of relations would use copy_file_range() to do the splicing, so that on file systems that are smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and destination, the operation can be very fast, and other degrees of optimisation are available to the kernel too even for file systems without block sharing magic (pushing down block range copies to hardware/network storage, etc). The copy_file_range() stuff could also be proposed independently (I vaguely recall it was discussed a few times before), it's just that it really comes into its own when you start splicing files together, as needed here, and it's also been adopted by FreeBSD with the same interface as Linux and has an efficient implementation in bleeding edge ZFS there. Stepping back, the main ideas are: (1) for some users of large databases, it would be painlessly done at upgrade time without even really noticing, using modern file system facilities where possible for speed; (2) for anyone who wants to defer that because of lack of fast copy_file_range() and a desire to avoid prolonged downtime by using links or reflinks, concatenation can be put off for the next N releases, giving a total of 5 + N years of option to defer the work, and in that case there are also many ways to proactively change to large format before the time comes with varying degrees of granularity and disruption. For example, set up a new replica and fail over, or VACUUM FULL tables one at a time, etc. There are plenty of things left to do in this patch set: pg_rewind doesn't understand optional segmentation yet, there are probably more things like that, and I expect there are some ssize_t vs pgoff_t confusions I missed that could bite a 32 bit system. But you can see the basics working on a typical system. I am not aware of any modern/non-historic filesystem[2] that can't do large files with ease. Anyone know of anything to worry about on that front? I think the main collateral damage would be weird old external tools like some weird old version of Windows tar I occasionally see mentioned, that sort of thing, but that'd just be another case of "well don't use that then", I guess? What else might we need to think about, outside PostgreSQL? What other problems might occur inside PostgreSQL? Clearly we'd need to figure out a decent strategy to automate testing of all of the relevant transitions. We could test the splicing code paths with an optional test suite that you might enable along with a small segment size (as we're already testing on CI and probably BF after the last round of segmentation bugs). To test the messy Windows off_t API stuff convincingly, we'd need actual > 4GB files, I think? Maybe doable cheaply with file system hole punching tricks. Speaking of file system holes, this patch set doesn't touch buffile.c That code wants to use segments for two extra purposes: (1) parallel create index merges workers' output using segmentation tricks as if there were holes in the file; this could perhaps be replaced with large files that make use of actual OS-level holes but I didn't feel like additionally claiming that all computers have spare files -- perhaps another approach is needed anyway; (2) buffile.c deliberately spreads large buffiles around across multiple temporary tablespaces using segments supposedly for space management reasons. So although it initially looks like a nice safe little place to start using large files, we'd need an answer to those design choices first. /me dons flameproof suit and goes back to working on LLVM problems for a while [1] https://wiki.postgresql.org/wiki/AllComputers [2] https://en.wikipedia.org/wiki/Comparison_of_file_systems
Attachment
- 0001-Assert-that-pgoff_t-is-wide-enough.patch
- 0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patch
- 0003-Support-large-files-on-Windows-in-our-VFD-API.patch
- 0004-Use-pgoff_t-instead-of-off_t-in-more-places.patch
- 0005-Use-large-files-for-relation-storage.patch
- 0006-Detect-copy_file_range-function.patch
- 0007-Use-copy_file_range-to-implement-copy_file.patch
- 0008-Teach-copy_file-to-concatenate-segmented-files.patch
- 0009-Use-copy_file_range-in-pg_upgrade.patch
- 0010-Teach-pg_upgrade-to-concatenate-segmented-files.patch
- 0011-Teach-basebackup-to-concatenate-segmented-files.patch
Hi
I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioning usually.
Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issue should be solved.
Regards
Pavel
On Tue, May 2, 2023 at 3:28 PM Pavel Stehule <pavel.stehule@gmail.com> wrote: > I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioningusually. Yeah, if you only use partitions of < 1GB it won't make a difference. Larger partitions are not uncommon, though. > Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issueshould be solved. Hmm, right, so there is a backup granularity continuum with "whole database cluster" at one end, "only files whose size, mtime [or optionally also checksum] changed since last backup" in the middle, and "only blocks that changed since LSN of last backup" at the other end. Getting closer to the right end of that continuum can make backups require less reading, less network transfer, less writing and/or less storage space depending on details. But this proposal moves the middle thing further to the left by changing the granularity from 1GB to whole relation, which can be gargantuan with this patch. Ultimately we need to be all the way at the right on that continuum, and there are clearly several people working on that goal. I'm not involved in any of those projects, but it's fun to think about an alien technology that produces complete standalone backups like rsync --link-dest (as opposed to "full" backups followed by a chain of "incremental" backups that depend on it so you need to retain them carefully) while still sharing disk blocks with older backups, and doing so with block granularity. TL;DW something something WAL something something copy_file_range().
On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > rsync --link-dest I wonder if rsync will grow a mode that can use copy_file_range() to share blocks with a reference file (= previous backup). Something like --copy-range-dest. That'd work for large-file relations (assuming a file system that has block sharing, like XFS and ZFS). You wouldn't get the "mtime is enough, I don't even need to read the bytes" optimisation, which I assume makes all database hackers feel a bit queasy anyway, but you'd get the space savings via the usual rolling checksum or a cheaper version that only looks for strong checksum matches at the same offset, or whatever other tricks rsync might have up its sleeve.
On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> rsync --link-dest
I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.
I understand the need to reduce open file handles, despite the possibilities enabled by using large numbers of small file sizes. Snowflake, for instance, sees everything in 1MB chunks, which makes massively parallel sequential scans (Snowflake's _only_ query plan) possible, though I don't know if they accomplish that via separate files, or via segments within a large file.
I am curious whether a move like this to create a generational change in file file format shouldn't be more ambitious, perhaps altering the block format to insert a block format version number, whether that be at every block, or every megabyte, or some other interval, and whether we store it in-file or in a separate file to accompany the first non-segmented. Having such versioning information would allow blocks of different formats to co-exist in the same table, which could be critical to future changes such as 64 bit XIDs, etc.
Greetings, * Corey Huinker (corey.huinker@gmail.com) wrote: > On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> > > wrote: > > > rsync --link-dest ... rsync isn't really a safe tool to use for PG backups by itself unless you're using it with archiving and with start/stop backup and with checksums enabled. > > I wonder if rsync will grow a mode that can use copy_file_range() to > > share blocks with a reference file (= previous backup). Something > > like --copy-range-dest. That'd work for large-file relations > > (assuming a file system that has block sharing, like XFS and ZFS). > > You wouldn't get the "mtime is enough, I don't even need to read the > > bytes" optimisation, which I assume makes all database hackers feel a > > bit queasy anyway, but you'd get the space savings via the usual > > rolling checksum or a cheaper version that only looks for strong > > checksum matches at the same offset, or whatever other tricks rsync > > might have up its sleeve. There's also really good reasons to have multiple full backups and not just a single full backup and then lots and lots of incrementals which basically boils down to "are you really sure that one copy of that one really important file won't every disappear from your backup repository..?" That said, pgbackrest does now have block-level incremental backups (where we define our own block size ...) and there's reasons we decided against going down the LSN-based approach (not the least of which is that the LSN isn't always updated...), but long story short, moving to larger than 1G files should be something that pgbackrest will be able to handle without as much impact as there would have been previously in terms of incremental backups. There is a loss in the ability to use mtime to scan just the parts of the relation that changed and that's unfortunate but I wouldn't see it as really a game changer (and yes, there's certainly an argument for not trusting mtime, though I don't think we've yet had a report where there was an mtime issue that our mtime-validity checking didn't catch and force pgbackrest into checksum-based revalidation automatically which resulted in an invalid backup... of course, not enough people test their backups...). > I understand the need to reduce open file handles, despite the > possibilities enabled by using large numbers of small file sizes. I'm also generally in favor of reducing the number of open file handles that we have to deal with. Addressing the concerns raised nearby about weird corner-cases of non-1G length ABCDEF.1 files existing while ABCDEF.2, and more, files exist is certainly another good argument in favor of getting rid of segments. > I am curious whether a move like this to create a generational change in > file file format shouldn't be more ambitious, perhaps altering the block > format to insert a block format version number, whether that be at every > block, or every megabyte, or some other interval, and whether we store it > in-file or in a separate file to accompany the first non-segmented. Having > such versioning information would allow blocks of different formats to > co-exist in the same table, which could be critical to future changes such > as 64 bit XIDs, etc. To the extent you're interested in this, there are patches posted which are alrady trying to move us in a direction that would allow for different page formats that add in space for other features such as 64bit XIDs, better checksums, and TDE tags to be supported. https://commitfest.postgresql.org/43/3986/ Currently those patches are expecting it to be declared at initdb time, but the way they're currently written that's more of a soft requirement as you can tell on a per-page basis what features are enabled for that page. Might make sense to support it in that form first anyway though, before going down the more ambitious route of allowing different pages to have different sets of features enabled for them concurrently. When it comes to 'a separate file', we do have forks already and those serve a very valuable but distinct use-case where you can get information from the much smaller fork (be it the FSM or the VM or some future thing) while something like 64bit XIDs or a stronger checksum is something you'd really need on every page. I have serious doubts about a proposal where we'd store information needed on every page read in some far away block that's still in the same file such as using something every 1MB as that would turn every block access into two.. Thanks, Stephen
Attachment
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote: > On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> I am not aware of any modern/non-historic filesystem[2] that can't do >> large files with ease. Anyone know of anything to worry about on that >> front? > > There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number ofusers of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut themax table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change? Hrmph. Yeah, that might be a bit of a problem. I see it discussed in various places that MySQL/InnoDB can't have tables bigger than 16TB on ext4 because of this, when it's in its default one-file-per-object mode (as opposed to its big-tablespace-files-to-hold-all-the-objects mode like DB2, Oracle etc, in which case I think you can have multiple 16TB segment files and get past that ext4 limit). It's frustrating because 16TB is still really, really big and you probably should be using partitions, or more partitions, to avoid all kinds of other scalability problems at that size. But however hypothetical the scenario might be, it should work, and this is certainly a plausible argument against the "aggressive" plan described above with the hard cut-off where we get to drop the segmented mode. Concretely, a 20TB pg_upgrade in copy mode would fail while trying to concatenate with the above patches, so you'd have to use link or reflink mode (you'd probably want to use that anyway unless due to sheer volume of data to copy otherwise, since ext4 is also not capable of block-range sharing), but then you'd be out of luck after N future major releases, according to that plan where we start deleting the code, so you'd need to organise some smaller partitions before that time comes. Or pg_upgrade to a target on xfs etc. I wonder if a future version of extN will increase its max file size. A less aggressive version of the plan would be that we just keep the segment code for the foreseeable future with no planned cut off, and we make all of those "piggy back" transformations that I showed in the patch set optional. For example, I had it so that CLUSTER would quietly convert your relation to large format, if it was still in segmented format (might as well if you're writing all the data out anyway, right?), but perhaps that could depend on a GUC. Likewise for base backup. Etc. Then someone concerned about hitting the 16TB limit on ext4 could opt out. Or something like that. It seems funny though, that's exactly the user who should want this feature (they have 16,000 relation segment files).
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote: >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> I am not aware of any modern/non-historic filesystem[2] that can't do >>> large files with ease. Anyone know of anything to worry about on that >>> front? >> >> There is some trouble in the ambiguity of what we mean by "modern" and >> "large files". There are still a large number of users of ext4 where >> the max file size is 16TB. Switching to a single large file per >> relation would effectively cut the max table size in half for those >> users. How would a user with say a 20TB table running on ext4 be >> impacted by this change? […] > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). If we're going to have to keep the segment code for the foreseeable future anyway, could we not get most of the benefit by increasing the segment size to something like 1TB? The vast majority of tables would fit in one file, and there would be less risk of hitting filesystem limits. - ilmari
On Thu, May 11, 2023 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease. Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?
Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work,
Agreed, it is frustrating, but it is not hypothetical. I have seen a number of
users having single tables larger than 16TB and don't use partitioning because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key. Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp edges
those workarounds introduce.
Greetings, * Dagfinn Ilmari Mannsåker (ilmari@ilmari.org) wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote: > >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >>> I am not aware of any modern/non-historic filesystem[2] that can't do > >>> large files with ease. Anyone know of anything to worry about on that > >>> front? > >> > >> There is some trouble in the ambiguity of what we mean by "modern" and > >> "large files". There are still a large number of users of ext4 where > >> the max file size is 16TB. Switching to a single large file per > >> relation would effectively cut the max table size in half for those > >> users. How would a user with say a 20TB table running on ext4 be > >> impacted by this change? > […] > > A less aggressive version of the plan would be that we just keep the > > segment code for the foreseeable future with no planned cut off, and > > we make all of those "piggy back" transformations that I showed in the > > patch set optional. For example, I had it so that CLUSTER would > > quietly convert your relation to large format, if it was still in > > segmented format (might as well if you're writing all the data out > > anyway, right?), but perhaps that could depend on a GUC. Likewise for > > base backup. Etc. Then someone concerned about hitting the 16TB > > limit on ext4 could opt out. Or something like that. It seems funny > > though, that's exactly the user who should want this feature (they > > have 16,000 relation segment files). > > If we're going to have to keep the segment code for the foreseeable > future anyway, could we not get most of the benefit by increasing the > segment size to something like 1TB? The vast majority of tables would > fit in one file, and there would be less risk of hitting filesystem > limits. While I tend to agree that 1GB is too small, 1TB seems like it's possibly going to end up on the too big side of things, or at least, if we aren't getting rid of the segment code then it's possibly throwing away the benefits we have from the smaller segments without really giving us all that much. Going from 1G to 10G would reduce the number of open file descriptors by quite a lot without having much of a net change on other things. 50G or 100G would reduce the FD handles further but starts to make us lose out a bit more on some of the nice parts of having multiple segments. Just some thoughts. Thanks, Stephen
Attachment
Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.
The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write
I have a vague memory that filesystems have improved in this regard.
The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write
I have a vague memory that filesystems have improved in this regard.
On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease. Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?
Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.
Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.
A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).
Mark Callaghan
mdcallag@gmail.com
mdcallag@gmail.com
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table therewill be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem sourcein a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked withinthe kernel) and was briefly locked while setting up a read. > > The workaround for writes was one of: > 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to2010) > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write > > I have a vague memory that filesystems have improved in this regard. (I am interpreting your "use XFS" to mean "use XFS instead of ext4".) Right, 80s file systems like UFS (and I suspect ext and ext2, which were probably based on similar ideas and ran on non-SMP machines?) used coarse grained locking including vnodes/inodes level. Then over time various OSes and file systems have improved concurrency. Brief digression, as someone who got started on IRIX in the 90 and still thinks those were probably the coolest computers: At SGI, first they replaced SysV UFS with EFS (E for extent-based allocation) and invented O_DIRECT to skip the buffer pool, and then blew the doors off everything with XFS, which maximised I/O concurrency and possibly (I guess, it's not open source so who knows?) involved a revamped VFS to lower stuff like inode locks, motivated by monster IRIX boxes with up to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I remember hearing lots of reports of various kinds of large systems going faster just by switching to XFS and there is lots of writing about that. ext4 certainly changed enormously. One reason back in those days (mid 2000s?) was the old fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file thing, and another was the lack of write concurrency especially for direct I/O, and probably lots more things. But that's all ancient history... As for ext4, we've detected and debugged clues about the gradual weakening of locking over time on this list: we know that concurrent read/write to the same page of a file was previously atomic, but when we switched to pread/pwrite for most data (ie not making use of the current file position), it ceased to be (a concurrent reader can see a mash-up of old and new data with visible cache line-ish stripes in it, so there isn't even a write-lock for the page); then we noticed that in later kernels even read/write ceased to be atomic (implicating a change in file size/file position interlocking, I guess). I also vaguely recall reading on here a long time ago that lseek() performance was dramatically improved with weaker inode interlocking, perhaps even in response to this very program's pathological SEEK_END call frequency (something I hope to fix, but I digress). So I think it's possible that the effect you mentioned is gone? I can think of a few differences compared to those other RDBMSs. There the discussion was about one-file-per-relation vs one-big-file-for-everything, whereas we're talking about one-file-per-relation vs many-files-per-relation (which doesn't change the point much, just making clear that I'm not proposing a 42PB file to whole everything, so you can still partition to get different files). We also usually call fsync in series in our checkpointer (after first getting the writebacks started with sync_file_range() some time sooner). Currently our code believes that it is not safe to call fdatasync() for files whose size might have changed. There is no basis for that in POSIX or in any system that I currently know of (though I haven't looked into it seriously), but I believe there was a historical file system that at some point in history interpreted "non-essential meta data" (the stuff POSIX allows it not to flush to disk) to include "the size of the file" (whereas POSIX really just meant that you don't have to synchronise the mtime and similar), which is probably why PostgreSQL has some code that calls fsync() on newly created empty WAL segments to "make sure the indirect blocks are down on disk" before allowing itself to use only fdatasync() later to overwrite it with data. The point being that, for the most important kind of interactive/user facing I/O latency, namely WAL flushes, we already use fdatasync(). It's possible that we could use it to flush relation data too (ie the relation files in question here, usually synchronised by the checkpointer) according to POSIX but it doesn't immediately seem like something that should be at all hot and it's background work. But perhaps I lack imagination. Thanks, thought-provoking stuff.
On Sat, May 13, 2023 at 11:01 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > > use XFS and O_DIRECT As for direct I/O, we're only just getting started on that. We currently can't produce more than one concurrent WAL write, and then for relation data, we just got very basic direct I/O support but we haven't yet got the asynchronous machinery to drive it properly (work in progress, more soon). I was just now trying to find out what the state of parallel direct writes is in ext4, and it looks like it's finally happening: https://www.phoronix.com/news/Linux-6.3-EXT4
On Fri, May 12, 2023 at 4:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.
>
> The workaround for writes was one of:
> 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
> 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write
>
> I have a vague memory that filesystems have improved in this regard.
(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
Yes, although when the decision was made it was probably ext-3 -> XFS. We suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.
Right, 80s file systems like UFS (and I suspect ext and ext2, which
Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS and ext source. Unix was easy back then -- one big kernel lock covers everything.
some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no
Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended. Eventually we fixed that.
Thanks for all of the details.
Mark Callaghan
mdcallag@gmail.com
mdcallag@gmail.com
On Fri, May 12, 2023 at 9:53 AM Stephen Frost <sfrost@snowman.net> wrote: > While I tend to agree that 1GB is too small, 1TB seems like it's > possibly going to end up on the too big side of things, or at least, > if we aren't getting rid of the segment code then it's possibly throwing > away the benefits we have from the smaller segments without really > giving us all that much. Going from 1G to 10G would reduce the number > of open file descriptors by quite a lot without having much of a net > change on other things. 50G or 100G would reduce the FD handles further > but starts to make us lose out a bit more on some of the nice parts of > having multiple segments. This is my view as well, more or less. I don't really like our current handling of relation segments; we know it has bugs, and making it non-buggy feels difficult. And there are performance issues as well -- file descriptor consumption, for sure, but also probably that crossing a file boundary likely breaks the operating system's ability to do readahead to some degree. However, I think we're going to find that moving to a system where we have just one file per relation fork and that file can be arbitrarily large is not fantastic, either. Jim's point about running into filesystem limits is a good one (hi Jim, long time no see!) and the problem he points out with ext4 is almost certainly not the only one. It doesn't just have to be filesystems, either. It could be a limitation of an archiving tool (tar, zip, cpio) or a file copy utility or whatever as well. A quick Google search suggests that most such things have been updated to use 64-bit sizes, but my point is that the set of things that can potentially cause problems is broader than just the filesystem. Furthermore, even when there's no hard limit at play, a smaller file size can occasionally be *convenient*, as in Pavel's example of using hard links to share storage between backups. From that point of view, a 16GB or 64GB or 256GB file size limit seems more convenient than no limit and more convenient than a large limit like 1TB. However, the bugs are the flies in the ointment (ahem). If we just make the segment size bigger but don't get rid of segments altogether, then we still have to fix the bugs that can occur when you do have multiple segments. I think part of Thomas's motivation is to dodge that whole category of problems. If we gradually deprecate multi-segment mode in favor of single-file-per-relation-fork, then the fact that the segment handling code has bugs becomes progressively less relevant. While that does make some sense, I'm not sure I really agree with the approach. The problem is that we're trading problems that we at least theoretically can fix somehow by hitting our code with a big enough hammer for an unknown set of problems that stem from limitations of software we don't control, maybe don't even know about. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks all for the feedback. It was a nice idea and it *almost* works, but it seems like we just can't drop segmented mode. And the automatic transition schemes I showed don't make much sense without that goal. What I'm hearing is that something simple like this might be more acceptable: * initdb --rel-segsize (cf --wal-segsize), default unchanged * pg_upgrade would convert if source and target don't match I would probably also leave out those Windows file API changes, too. --rel-segsize would simply refuse larger sizes until someone does the work on that platform, to keep the initial proposal small. I would probably leave the experimental copy_on_write() ideas out too, for separate discussion in a separate proposal.
On 24.05.23 02:34, Thomas Munro wrote: > Thanks all for the feedback. It was a nice idea and it *almost* > works, but it seems like we just can't drop segmented mode. And the > automatic transition schemes I showed don't make much sense without > that goal. > > What I'm hearing is that something simple like this might be more acceptable: > > * initdb --rel-segsize (cf --wal-segsize), default unchanged makes sense > * pg_upgrade would convert if source and target don't match This would be good, but it could also be an optional or later feature. Maybe that should be a different mode, like --copy-and-adjust-as-necessary, so that users would have to opt into what would presumably be slower than plain --copy, rather than being surprised by it, if they unwittingly used incompatible initdb options. > I would probably also leave out those Windows file API changes, too. > --rel-segsize would simply refuse larger sizes until someone does the > work on that platform, to keep the initial proposal small. Those changes from off_t to pgoff_t? Yes, it would be good to do without those. Apart of the practical problems that have been brought up, this was a major annoyance with the proposed patch set IMO. > I would probably leave the experimental copy_on_write() ideas out too, > for separate discussion in a separate proposal. right
On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > What I'm hearing is that something simple like this might be more acceptable: > > > > * initdb --rel-segsize (cf --wal-segsize), default unchanged > > makes sense +1. > > * pg_upgrade would convert if source and target don't match > > This would be good, but it could also be an optional or later feature. +1. I think that would be nice to have, but not absolutely required. IMHO it's best not to overcomplicate these projects. Not everything needs to be part of the initial commit. If the initial commit happens 2 months from now and then stuff like this gets added over the next 8, that's strictly better than trying to land the whole patch set next March. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote: > On 24.05.23 02:34, Thomas Munro wrote: > > Thanks all for the feedback. It was a nice idea and it *almost* > > works, but it seems like we just can't drop segmented mode. And the > > automatic transition schemes I showed don't make much sense without > > that goal. > > > > What I'm hearing is that something simple like this might be more acceptable: > > > > * initdb --rel-segsize (cf --wal-segsize), default unchanged > > makes sense Agreed, this seems alright in general. Having more initdb-time options to help with certain use-cases rather than having things be compile-time is definitely just generally speaking a good direction to be going in, imv. > > * pg_upgrade would convert if source and target don't match > > This would be good, but it could also be an optional or later feature. Agreed. > Maybe that should be a different mode, like --copy-and-adjust-as-necessary, > so that users would have to opt into what would presumably be slower than > plain --copy, rather than being surprised by it, if they unwittingly used > incompatible initdb options. I'm curious as to why it would be slower than a regular copy..? > > I would probably also leave out those Windows file API changes, too. > > --rel-segsize would simply refuse larger sizes until someone does the > > work on that platform, to keep the initial proposal small. > > Those changes from off_t to pgoff_t? Yes, it would be good to do without > those. Apart of the practical problems that have been brought up, this was > a major annoyance with the proposed patch set IMO. > > > I would probably leave the experimental copy_on_write() ideas out too, > > for separate discussion in a separate proposal. > > right You mean copy_file_range() here, right? Shouldn't we just add support for that today into pg_upgrade, independently of this? Seems like a worthwhile improvement even without the benefit it would provide to changing segment sizes. Thanks, Stephen
Attachment
On Thu, May 25, 2023 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote: > * Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote: > > On 24.05.23 02:34, Thomas Munro wrote: > > > * pg_upgrade would convert if source and target don't match > > > > This would be good, but it could also be an optional or later feature. > > Agreed. OK. I do have a patch for that, but I'll put that (+ copy_file_range) aside for now so we can talk about the basic feature. Without that, pg_upgrade just rejects mismatching clusters as it always did, no change required. > > > I would probably also leave out those Windows file API changes, too. > > > --rel-segsize would simply refuse larger sizes until someone does the > > > work on that platform, to keep the initial proposal small. > > > > Those changes from off_t to pgoff_t? Yes, it would be good to do without > > those. Apart of the practical problems that have been brought up, this was > > a major annoyance with the proposed patch set IMO. +1, it was not nice. Alright, since I had some time to kill in an airport, here is a starter patch for initdb --rel-segsize. Some random thoughts: Another potential option name would be --segsize, if we think we're going to use this for temp files too eventually. Maybe it's not so beautiful to have that global variable rel_segment_size (which replaces REL_SEGSIZE everywhere). Another idea would be to make it static in md.c and call smgrsetsegmentsize(), or something like that. That could be a nice place to compute the "shift" value up front, instead of computing it each time in blockno_to_segno(), but that's probably not worth bothering with (?). BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's about the only place where someone could say that this change makes things worse for people not interested in the new feature, so I was careful to get rid of / and % operations with no-longer-constant RHS. I had to promote segment size to int64 (global variable, field in control file), because otherwise it couldn't represent --rel-segsize=32TB (it'd be too big by one). Other ideas would be to store the shift value instead of the size, or store the max block number, eg subtract one, or use InvalidBlockNumber to mean "no limit" (with more branches to test for it). The only problem I ran into with the larger type was that 'SHOW segment_size' now needs a custom show function because we don't have int64 GUCs. A C type confusion problem that I noticed: some code uses BlockNumber and some code uses int for segment numbers. It's not really a reachable problem for practical reasons (you'd need over 2 billion directories and VFDs to reach it), but it's wrong to use int if segment size can be set as low as BLCKSZ (one file per block); you could have more segments than an int can represent. We could go for uint32, BlockNumber or create SegmentNumber (which I think I've proposed before, and lost track of...). We can address that separately (perhaps by finding my old patch...)
Attachment
On Sun, May 28, 2023 at 2:48 AM Thomas Munro <thomas.munro@gmail.com> wrote: > (you'd need over 2 billion > directories ... directory *entries* (segment files), I meant to write there.
On 28.05.23 02:48, Thomas Munro wrote: > Another potential option name would be --segsize, if we think we're > going to use this for temp files too eventually. > > Maybe it's not so beautiful to have that global variable > rel_segment_size (which replaces REL_SEGSIZE everywhere). Another > idea would be to make it static in md.c and call smgrsetsegmentsize(), > or something like that. I think one way to look at this is that the segment size is a configuration property of the md.c smgr. I have been thinking a bit about how smgr-level configuration could look. You can't use a catalog table, but we also can't have smgr plugins get space in pg_control. Anyway, I'm not asking you to design this now. A global variable via pg_control seems fine for now. But it wouldn't be an smgr API call, I think.
On 5/28/23 08:48, Thomas Munro wrote: > > Alright, since I had some time to kill in an airport, here is a > starter patch for initdb --rel-segsize. I've gone through this patch and it looks pretty good to me. A few things: + * rel_setment_size, we will truncate the K+1st segment to 0 length rel_setment_size -> rel_segment_size + * We used a phony GUC with a custome show function, because we don't custome -> custom + if (strcmp(endptr, "kB") == 0) Why kB here instead of KB to match MB, GB, TB below? + int64 relseg_size; /* blocks per segment of large relation */ This will require PG_CONTROL_VERSION to be bumped -- but you are probably waiting until commit time to avoid annoying conflicts, though I don't think it is as likely as with CATALOG_VERSION_NO. > Some random thoughts: > > Another potential option name would be --segsize, if we think we're > going to use this for temp files too eventually. I feel like temp file segsize should be separately configurable for the same reason that we are leaving it as 1GB for now. > Maybe it's not so beautiful to have that global variable > rel_segment_size (which replaces REL_SEGSIZE everywhere). Maybe not, but it is the way these things are done in general, .e.g. wal_segment_size, so I don't think it will be too controversial. > Another > idea would be to make it static in md.c and call smgrsetsegmentsize(), > or something like that. That could be a nice place to compute the > "shift" value up front, instead of computing it each time in > blockno_to_segno(), but that's probably not worth bothering with (?). > BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's > about the only place where someone could say that this change makes > things worse for people not interested in the new feature, so I was > careful to get rid of / and % operations with no-longer-constant RHS. Right -- not sure we should be troubling ourselves with trying to optimize away ops that are very fast, unless they are computed trillions of times. > I had to promote segment size to int64 (global variable, field in > control file), because otherwise it couldn't represent > --rel-segsize=32TB (it'd be too big by one). Other ideas would be to > store the shift value instead of the size, or store the max block > number, eg subtract one, or use InvalidBlockNumber to mean "no limit" > (with more branches to test for it). The only problem I ran into with > the larger type was that 'SHOW segment_size' now needs a custom show > function because we don't have int64 GUCs. A custom show function seems like a reasonable solution here. > A C type confusion problem that I noticed: some code uses BlockNumber > and some code uses int for segment numbers. It's not really a > reachable problem for practical reasons (you'd need over 2 billion > directories and VFDs to reach it), but it's wrong to use int if > segment size can be set as low as BLCKSZ (one file per block); you > could have more segments than an int can represent. We could go for > uint32, BlockNumber or create SegmentNumber (which I think I've > proposed before, and lost track of...). We can address that > separately (perhaps by finding my old patch...) I think addressing this separately is fine, though maybe enforcing some reasonable minimum in initdb would be a good idea for this patch. For my 2c SEGSIZE == BLOCKSZ just makes very little sense. Lastly, I think the blockno_to_segno(), blockno_within_segment(), and blockno_to_seekpos() functions add enough readability that they should be committed regardless of how this patch proceeds. Regards, -David
On Mon, Jun 12, 2023 at 8:53 PM David Steele <david@pgmasters.net> wrote: > + if (strcmp(endptr, "kB") == 0) > > Why kB here instead of KB to match MB, GB, TB below? Those are SI prefixes[1], and we use kB elsewhere too. ("K" was used for kelvins, so they went with "k" for kilo. Obviously these aren't fully SI, because B is supposed to mean bel. A gigabel would be pretty loud... more than "sufficient power to create a black hole"[2], hehe.) > + int64 relseg_size; /* blocks per segment of large relation */ > > This will require PG_CONTROL_VERSION to be bumped -- but you are > probably waiting until commit time to avoid annoying conflicts, though I > don't think it is as likely as with CATALOG_VERSION_NO. Oh yeah, thanks. > > Another > > idea would be to make it static in md.c and call smgrsetsegmentsize(), > > or something like that. That could be a nice place to compute the > > "shift" value up front, instead of computing it each time in > > blockno_to_segno(), but that's probably not worth bothering with (?). > > BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's > > about the only place where someone could say that this change makes > > things worse for people not interested in the new feature, so I was > > careful to get rid of / and % operations with no-longer-constant RHS. > > Right -- not sure we should be troubling ourselves with trying to > optimize away ops that are very fast, unless they are computed trillions > of times. This obviously has some things in common with David Christensen's nearby patch for block sizes[3], and we should be shifting and masking there too if that route is taken (as opposed to a specialise-the-code route or somethign else). My binary-log trick is probably a little too cute though... I should probably just go and set a shift variable. Thanks for looking! [1] https://en.wikipedia.org/wiki/Metric_prefix [2] https://en.wiktionary.org/wiki/gigabel [3] https://www.postgresql.org/message-id/flat/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u%2BQHUwRa1Gg%40mail.gmail.com
Rebased. I had intended to try to get this into v17, but a couple of unresolved problems came up while rebasing over the new incremental backup stuff. You snooze, you lose. Hopefully we can sort these out in time for the next commitfest: * should pg_combinebasebackup read the control file to fetch the segment size? * hunt for other segment-size related problems that may be lurking in new incremental backup stuff * basebackup_incremental.c wants to use memory in proportion to segment size, which looks like a problem, and I wrote about that in a new thread[1] [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com
Attachment
On 06.03.24 22:54, Thomas Munro wrote: > Rebased. I had intended to try to get this into v17, but a couple of > unresolved problems came up while rebasing over the new incremental > backup stuff. You snooze, you lose. Hopefully we can sort these out > in time for the next commitfest: > > * should pg_combinebasebackup read the control file to fetch the segment size? > * hunt for other segment-size related problems that may be lurking in > new incremental backup stuff > * basebackup_incremental.c wants to use memory in proportion to > segment size, which looks like a problem, and I wrote about that in a > new thread[1] Overall, I like this idea, and the patch seems to have many bases covered. The patch will need a rebase. I was able to test it on master@{2024-03-13}, but after that there are conflicts. In .cirrus.tasks.yml, one of the test tasks uses --with-segsize-blocks=6, but you are removing that option. You could replace that with something like PG_TEST_INITDB_EXTRA_OPTS='--rel-segsize=48kB' But that won't work exactly because initdb: error: argument of --rel-segsize must be a power of two I suppose that's ok as a change, since it makes the arithmetic more efficient. But maybe it should be called out explicitly in the commit message. If I run it with 64kB, the test pgbench/001_pgbench_with_server fails consistently, so it seems there is still a gap somewhere. A minor point, the initdb error message initdb: error: argument of --rel-segsize must be a multiple of BLCKSZ would be friendlier if actually showed the value of the block size instead of just the symbol. Similarly for the nearby error message about the off_t size. In the control file, all the other fields use unsigned types. Should relseg_size be uint64? PG_CONTROL_VERSION needs to be changed.