Re: Large files for relations - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Large files for relations |
Date | |
Msg-id | CA+hUKGJsT8G_YyjUzMZaJTWyua6PbwC3TAUMv_kDS0F0vzr2Pw@mail.gmail.com Whole thread Raw |
In response to | Re: Large files for relations (MARK CALLAGHAN <mdcallag@gmail.com>) |
Responses |
Re: Large files for relations
Re: Large files for relations |
List | pgsql-hackers |
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table therewill be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem sourcein a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked withinthe kernel) and was briefly locked while setting up a read. > > The workaround for writes was one of: > 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to2010) > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write > > I have a vague memory that filesystems have improved in this regard. (I am interpreting your "use XFS" to mean "use XFS instead of ext4".) Right, 80s file systems like UFS (and I suspect ext and ext2, which were probably based on similar ideas and ran on non-SMP machines?) used coarse grained locking including vnodes/inodes level. Then over time various OSes and file systems have improved concurrency. Brief digression, as someone who got started on IRIX in the 90 and still thinks those were probably the coolest computers: At SGI, first they replaced SysV UFS with EFS (E for extent-based allocation) and invented O_DIRECT to skip the buffer pool, and then blew the doors off everything with XFS, which maximised I/O concurrency and possibly (I guess, it's not open source so who knows?) involved a revamped VFS to lower stuff like inode locks, motivated by monster IRIX boxes with up to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I remember hearing lots of reports of various kinds of large systems going faster just by switching to XFS and there is lots of writing about that. ext4 certainly changed enormously. One reason back in those days (mid 2000s?) was the old fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file thing, and another was the lack of write concurrency especially for direct I/O, and probably lots more things. But that's all ancient history... As for ext4, we've detected and debugged clues about the gradual weakening of locking over time on this list: we know that concurrent read/write to the same page of a file was previously atomic, but when we switched to pread/pwrite for most data (ie not making use of the current file position), it ceased to be (a concurrent reader can see a mash-up of old and new data with visible cache line-ish stripes in it, so there isn't even a write-lock for the page); then we noticed that in later kernels even read/write ceased to be atomic (implicating a change in file size/file position interlocking, I guess). I also vaguely recall reading on here a long time ago that lseek() performance was dramatically improved with weaker inode interlocking, perhaps even in response to this very program's pathological SEEK_END call frequency (something I hope to fix, but I digress). So I think it's possible that the effect you mentioned is gone? I can think of a few differences compared to those other RDBMSs. There the discussion was about one-file-per-relation vs one-big-file-for-everything, whereas we're talking about one-file-per-relation vs many-files-per-relation (which doesn't change the point much, just making clear that I'm not proposing a 42PB file to whole everything, so you can still partition to get different files). We also usually call fsync in series in our checkpointer (after first getting the writebacks started with sync_file_range() some time sooner). Currently our code believes that it is not safe to call fdatasync() for files whose size might have changed. There is no basis for that in POSIX or in any system that I currently know of (though I haven't looked into it seriously), but I believe there was a historical file system that at some point in history interpreted "non-essential meta data" (the stuff POSIX allows it not to flush to disk) to include "the size of the file" (whereas POSIX really just meant that you don't have to synchronise the mtime and similar), which is probably why PostgreSQL has some code that calls fsync() on newly created empty WAL segments to "make sure the indirect blocks are down on disk" before allowing itself to use only fdatasync() later to overwrite it with data. The point being that, for the most important kind of interactive/user facing I/O latency, namely WAL flushes, we already use fdatasync(). It's possible that we could use it to flush relation data too (ie the relation files in question here, usually synchronised by the checkpointer) according to POSIX but it doesn't immediately seem like something that should be at all hot and it's background work. But perhaps I lack imagination. Thanks, thought-provoking stuff.
pgsql-hackers by date: