Hi,
IRIX gave the world O_DIRECT, and then every Unix I've used followed
their lead except Apple's, which gave the world fcntl(fd, F_NOCACHE,
1). From what I could find in public discussion, this API difference
may stem from the caching policy being controlled at the per-file
(vnode) level in older macOS (and perhaps ancestors), but since 10.4
it's per file descriptor, so approximately like O_DIRECT on other
systems. The precise effects and constraints of O_DIRECT/F_NOCACHE
are different across operating systems and file systems in some subtle
and not-so-subtle ways, but the general concept is the same: try to
avoid buffering.
I thought about a few different ways to encapsulate this API
difference in PostgreSQL, and toyed with two:
1. We could define our own fake O_DIRECT flag, and translate that to
the right thing inside BasicOpenFilePerm(). That seems a bit icky.
We'd have to be careful not to collide with system defined flags and
worry about changes. We do that sort of thing for Windows, though
that's a bit different, there we translate *all* the flags from
POSIXesque to Windowsian.
2. We could make an extended BasicOpenFilePerm() variant that takes a
separate boolean parameter for direct, so that we don't have to hijack
any flag space, but now we need new interfaces just to tolerate a
rather niche system.
Here's a draft patch like #2, just for discussion. Better ideas?
The reason I want to get direct I/O working on this "client" OS is
because the AIO project will propose to use direct I/O for the buffer
pool as an option, and I would like Macs to be able to do that
primarily for the sake of developers trying out the patch set. Based
on memories from the good old days of attending conferences, a decent
percentage of PostgreSQL developers are on Macs.
As it stands, the patch only actually has any effect if you set
wal_level=minimal and max_wal_senders=0, which is a configuration that
I guess almost no-one uses. Otherwise xlog.c assumes that the
filesystem is going to be used for data exchange with replication
processes (something we should replace with WAL buffers in shmem some
time soon) so for now it's better to keep the data in page cache since
it'll be accessed again soon.
Unfortunately, this change makes pg_test_fsync show a very slightly
lower number for open_data_sync on my ancient Intel Mac, but
pg_test_fsync isn't really representative anymore since minimal
logging is by now unusual (I guess pg_test_fsync would ideally do the
test with and without direct to make that clearer). Whether this is a
good option for the WAL is separate from whether it's a good option
for relation data (ie a way to avoid large scale double buffering, but
have new, different problems), and later patches will propose new
separate GUCs to control that.