On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read. > > The workaround for writes was one of: > 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010) > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write > > I have a vague memory that filesystems have improved in this regard.
(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
Yes, although when the decision was made it was probably ext-3 -> XFS. We suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.
Right, 80s file systems like UFS (and I suspect ext and ext2, which
Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS and ext source. Unix was easy back then -- one big kernel lock covers everything.
some time sooner). Currently our code believes that it is not safe to call fdatasync() for files whose size might have changed. There is no
Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases when O_DIRECT was used. While great for performance we also forgot to make sure they were still done when files were extended. Eventually we fixed that.