On Fri, Feb 14, 2025 at 2:02 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> On Wed, 12 Feb 2025 at 17:49, Andres Freund <andres@anarazel.de> wrote:
> > Obviously not your fault, but I do think it's pretty crazy that with the same
> > available resources, netbsd and openbsd take considerably longer than linux
> > and freebsd, which both do a lot more (linux tests 32 and 64 bit with ubsan
> > with autoconf and asan with meson, freebsd tests a bunch of debugging
> > options). I wonder what is going wrong. I suspect we might be waiting for
> > the filesystem a lot, according to cirrus-ci's CPU usage graph we're not CPU
> > bound during the test phase. Or maybe they just scale badly.
>
> Yes, I could not find the reason why either.
I don't know much about NetBSD and OpenBSD, but my guess would be that
its UFS is serialising on synchronous I/O for some file/directory
operations that we do a lot of. Perhaps something like commit
0265e5c1 could be ported to those guys too? I suspect that Windows
has the same problem, I just don't know how to script a RAM disk as
all the web instructions start with "open the blah GUI and click on
the ...". I think the Linux file systems are just much better at that
stuff due to journaling and deferring I/O, but these other systems
probably issue synchronous writes, possibly with drive cache flushes
in some cases that you might not be able to turn off*. As for macOS,
I don't know much except that it doesn't seem too interested in
flushing the write cache generally, and is currently our 2nd fastest
OS on CI anyway.
Maybe we could try this?
https://man.netbsd.org/mount_tmpfs.8
https://man.openbsd.org/mount_tmpfs.8
FreeBSD has the same thing too, but I haven't looked into how that
compares to mounting UFS on a RAM disk device as I did it. Linux has
a tmpfs too, but we know that it refuses O_DIRECT so I wouldn't
propose that, it's essential to test those code paths. OpenBSD is the
only system in our universe without O_DIRECT, and I don't know what
NetBSD's policy is on O_DIRECT on tmpfs and wouldn't worry about it
too much anyway. (This is like a Zen riddle: what does it mean to
skip the page cache if the filesystem is the page cache? Mu. The
Linux answer is: no, go away.)
Thanks for working on these! I'll have another go at my workaround
for the SysV IPC resource limits on those two OSes, which is relevant
to the I/O worker stuff.
*In my hobbyist OS hacking I had an on/off switch for that for
throwaway FreeBSD CI systems where you don't care about durability,
though I didn't go anywhere with it after the RAM disk did the trick.