Thread: O_DIRECT on macOS
Hi, IRIX gave the world O_DIRECT, and then every Unix I've used followed their lead except Apple's, which gave the world fcntl(fd, F_NOCACHE, 1). From what I could find in public discussion, this API difference may stem from the caching policy being controlled at the per-file (vnode) level in older macOS (and perhaps ancestors), but since 10.4 it's per file descriptor, so approximately like O_DIRECT on other systems. The precise effects and constraints of O_DIRECT/F_NOCACHE are different across operating systems and file systems in some subtle and not-so-subtle ways, but the general concept is the same: try to avoid buffering. I thought about a few different ways to encapsulate this API difference in PostgreSQL, and toyed with two: 1. We could define our own fake O_DIRECT flag, and translate that to the right thing inside BasicOpenFilePerm(). That seems a bit icky. We'd have to be careful not to collide with system defined flags and worry about changes. We do that sort of thing for Windows, though that's a bit different, there we translate *all* the flags from POSIXesque to Windowsian. 2. We could make an extended BasicOpenFilePerm() variant that takes a separate boolean parameter for direct, so that we don't have to hijack any flag space, but now we need new interfaces just to tolerate a rather niche system. Here's a draft patch like #2, just for discussion. Better ideas? The reason I want to get direct I/O working on this "client" OS is because the AIO project will propose to use direct I/O for the buffer pool as an option, and I would like Macs to be able to do that primarily for the sake of developers trying out the patch set. Based on memories from the good old days of attending conferences, a decent percentage of PostgreSQL developers are on Macs. As it stands, the patch only actually has any effect if you set wal_level=minimal and max_wal_senders=0, which is a configuration that I guess almost no-one uses. Otherwise xlog.c assumes that the filesystem is going to be used for data exchange with replication processes (something we should replace with WAL buffers in shmem some time soon) so for now it's better to keep the data in page cache since it'll be accessed again soon. Unfortunately, this change makes pg_test_fsync show a very slightly lower number for open_data_sync on my ancient Intel Mac, but pg_test_fsync isn't really representative anymore since minimal logging is by now unusual (I guess pg_test_fsync would ideally do the test with and without direct to make that clearer). Whether this is a good option for the WAL is separate from whether it's a good option for relation data (ie a way to avoid large scale double buffering, but have new, different problems), and later patches will propose new separate GUCs to control that.
Attachment
On Sun, May 30, 2021 at 04:39:48PM +1200, Thomas Munro wrote: > +BasicOpenFilePermDirect(const char *fileName, int fileFlags, mode_t fileMode, > + bool direct) > ... > +#if !defined(O_DIRECT) && defined(F_NOCACHE) > + /* macOS requires an extra step. */ > + if (direct && fcntl(fd, F_NOCACHE, 1) < 0) > + { > + int save_errno = errno; > + > + close(fd); > + errno = save_errno; > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not disable kernel file caching for file \"%s\": %m", > + fileName))); > + } > +#endif Should there be an "else" to warn/error in the case that "direct" is requested but not supported? -- Justin
Hi, Thanks for starting the discussion on this! On 2021-05-30 16:39:48 +1200, Thomas Munro wrote: > I thought about a few different ways to encapsulate this API > difference in PostgreSQL, and toyed with two: > > 1. We could define our own fake O_DIRECT flag, and translate that to > the right thing inside BasicOpenFilePerm(). That seems a bit icky. > We'd have to be careful not to collide with system defined flags and > worry about changes. We do that sort of thing for Windows, though > that's a bit different, there we translate *all* the flags from > POSIXesque to Windowsian. > > 2. We could make an extended BasicOpenFilePerm() variant that takes a > separate boolean parameter for direct, so that we don't have to hijack > any flag space, but now we need new interfaces just to tolerate a > rather niche system. I don't think 2) really covers the problem on its own. It's fine for things that directly use BasicOpenFilePerm(), but what about "virtual file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There we need to store the fact that we want non-buffered IO as part of the vfd, otherwise we'll loose that information when re-opening the file later. Greetings, Andres Freund
On Mon, May 31, 2021 at 8:12 AM Andres Freund <andres@anarazel.de> wrote: > On 2021-05-30 16:39:48 +1200, Thomas Munro wrote: > > I thought about a few different ways to encapsulate this API > > difference in PostgreSQL, and toyed with two: > > > > 1. We could define our own fake O_DIRECT flag, and translate that to > > the right thing inside BasicOpenFilePerm(). That seems a bit icky. > > We'd have to be careful not to collide with system defined flags and > > worry about changes. We do that sort of thing for Windows, though > > that's a bit different, there we translate *all* the flags from > > POSIXesque to Windowsian. > > > > 2. We could make an extended BasicOpenFilePerm() variant that takes a > > separate boolean parameter for direct, so that we don't have to hijack > > any flag space, but now we need new interfaces just to tolerate a > > rather niche system. > > I don't think 2) really covers the problem on its own. It's fine for > things that directly use BasicOpenFilePerm(), but what about "virtual > file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There > we need to store the fact that we want non-buffered IO as part of the > vfd, otherwise we'll loose that information when re-opening the file > later. Right, a bit more API perturbation is required to traffic the separate flags around for VFDs, which is all a bit unpleasant for a feature that most people don't care about. For comparison, here is my sketch of idea #1. I pick an arbitrary value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear of breaking other code that might see it and try to pass it into open()... for all I know, it might happen to match OS-internal value O_NASAL_DEMONS), and statically assert that it doesn't collide with standard flags we're using, and I strip it out of the flags I pass in to open(). As I said, a bit icky, but it's a tiny and localised patch, which is nice. I also realised that it probably wasn't right to raise an ERROR, so in this version I return -1 when fcntl() fails.
Attachment
On Mon, May 31, 2021 at 4:19 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > Should there be an "else" to warn/error in the case that "direct" is requested > but not supported? The way we use O_DIRECT currently is extremely minimal, it's just "if you've got it, we'll use it, but otherwise not complain", and I wasn't trying to change that yet, but you're right that if we add explicit GUCs to turn on direct I/O for WAL and data files we should definitely not let you turn them on if we can't do it.
On Mon, May 31, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote: > For comparison, here is my sketch of idea #1. I pick an arbitrary > value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear > of breaking other code that might see it and try to pass it into > open()... for all I know, it might happen to match OS-internal value > O_NASAL_DEMONS), and statically assert that it doesn't collide with > standard flags we're using, and I strip it out of the flags I pass in > to open(). As I said, a bit icky, but it's a tiny and localised > patch, which is nice. I'm planning to go with that idea (#1), if there are no objections.
Hi, On 2021-07-13 13:25:50 +1200, Thomas Munro wrote: > On Mon, May 31, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > For comparison, here is my sketch of idea #1. I pick an arbitrary > > value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear > > of breaking other code that might see it and try to pass it into > > open()... for all I know, it might happen to match OS-internal value > > O_NASAL_DEMONS), and statically assert that it doesn't collide with > > standard flags we're using, and I strip it out of the flags I pass in > > to open(). As I said, a bit icky, but it's a tiny and localised > > patch, which is nice. > > I'm planning to go with that idea (#1), if there are no objections. The only other viable approach I see is to completely separate our internal flag representation from the OS representation and do the whole mapping inside fd.c - but that seems like a too big hammer right now. Greetings, Andres Freund
On Tue, Jul 13, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-07-13 13:25:50 +1200, Thomas Munro wrote: > > I'm planning to go with that idea (#1), if there are no objections. > > The only other viable approach I see is to completely separate our > internal flag representation from the OS representation and do the whole > mapping inside fd.c - but that seems like a too big hammer right now. Agreed. Pushed! For the record, Solaris has directio() that could be handled the same way. I'm not planning to look into that myself, but patches welcome. Illumos (née OpenSolaris) got with the programme and added O_DIRECT. Of our 10-or-so target systems I guess that'd leave just HPUX (judging by an old man page found on the web) and OpenBSD with no direct I/O support.
Thomas Munro <thomas.munro@gmail.com> writes: > Agreed. Pushed! prairiedog thinks that Assert is too optimistic about whether all those flags exist. regards, tom lane
On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > prairiedog thinks that Assert is too optimistic about whether all > those flags exist. Fixed. (Huh, I received no -committers email for 2dbe8905.)
On Mon, Jul 19, 2021 at 12:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > prairiedog thinks that Assert is too optimistic about whether all
> > those flags exist.
>
> Fixed.
>
> (Huh, I received no -committers email for 2dbe8905.)
It didn't show up in the archives, either. Neither did your follow-up 04cad8f7bc.
--
John Naylor
EDB: http://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> prairiedog thinks that Assert is too optimistic about whether all >> those flags exist. > Fixed. Hmm ... we used to have to avoid putting #if constructs in the arguments of macros (such as StaticAssertStmt). Maybe that's not a thing anymore with C99, and in any case this whole stanza is fairly platform-specific so we may not run into a compiler that complains. But my hindbrain wants to see this done with separate statements, eg #if defined(O_CLOEXEC) StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0, "PG_O_DIRECT collides with O_CLOEXEC"); #endif regards, tom lane
On Tue, Jul 20, 2021 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm ... we used to have to avoid putting #if constructs in the arguments > of macros (such as StaticAssertStmt). Maybe that's not a thing anymore > with C99, and in any case this whole stanza is fairly platform-specific > so we may not run into a compiler that complains. But my hindbrain wants > to see this done with separate statements, eg > > #if defined(O_CLOEXEC) > StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0, > "PG_O_DIRECT collides with O_CLOEXEC"); > #endif Ok, done. While I was here again, I couldn't resist trying to extend this to Solaris, since it looked so easy. I don't have access, but I tested on Illumos by undefining O_DIRECT. Thoughts?
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > While I was here again, I couldn't resist trying to extend this to > Solaris, since it looked so easy. I don't have access, but I tested > on Illumos by undefining O_DIRECT. Thoughts? I can try that on the gcc farm in a bit. regards, tom lane
I wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> While I was here again, I couldn't resist trying to extend this to >> Solaris, since it looked so easy. I don't have access, but I tested >> on Illumos by undefining O_DIRECT. Thoughts? > I can try that on the gcc farm in a bit. Hmm, it compiles cleanly, but something seems drastically wrong, because performance is just awful. On the other hand, I don't know what sort of storage is underlying this instance, so maybe that's to be expected? If I set fsync = off, the speed seems comparable to what wrasse reports, but with fsync on it's like test tablespace ... ok 87990 ms parallel group (20 tests, in groups of 1): boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txiduuid enum money rangetypes pg_lsn regproc boolean ... ok 3229 ms char ... ok 2758 ms name ... ok 2229 ms varchar ... ok 7373 ms text ... ok 722 ms int2 ... ok 342 ms int4 ... ok 1303 ms int8 ... ok 1095 ms oid ... ok 1086 ms float4 ... ok 6360 ms float8 ... ok 5224 ms bit ... ok 6254 ms numeric ... ok 44304 ms txid ... ok 377 ms uuid ... ok 3946 ms enum ... ok 33189 ms money ... ok 622 ms rangetypes ... ok 17301 ms pg_lsn ... ok 798 ms regproc ... ok 145 ms (I stopped running it at that point...) Also, the results of pg_test_fsync seem wrong; it refuses to run tests for the cases we're interested in: $ pg_test_fsync 5 seconds per test DIRECTIO_ON supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 8.324 ops/sec 120139 usecs/op fsync 0.906 ops/sec 1103936 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 7.329 ops/sec 136449 usecs/op fsync 0.788 ops/sec 1269258 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write n/a* 2 * 8kB open_sync writes n/a* 4 * 4kB open_sync writes n/a* 8 * 2kB open_sync writes n/a* 16 * 1kB open_sync writes n/a* Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 16.388 ops/sec 61020 usecs/op write, close, fsync 9.084 ops/sec 110082 usecs/op Non-sync'ed 8kB writes: write 39855.686 ops/sec 25 usecs/op regards, tom lane
On Tue, Jul 20, 2021 at 12:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I can try that on the gcc farm in a bit. Thanks! > Hmm, it compiles cleanly, but something seems drastically wrong, > because performance is just awful. On the other hand, I don't > know what sort of storage is underlying this instance, so maybe > that's to be expected? Ouch. I assume this was without wal_method=minimal (or it'd have reached the new code and failed completely, based on the pg_test_fsync result). > open_datasync n/a* I'm waiting for access, but I see from man pages that closed source ZFS doesn't accept DIRECTIO_ON, so it may not be possible to see it work on an all-ZFS system that you can't mount a new FS on. Hmm. Well, many OSes have file systems that can't do it (ext4 journal=data, etc). One problem is that we don't treat all OSes the same when selecting wal_sync_method, even though O_DIRECT is complicated on many OSes. It would also be nice if the choice to use direct I/O were independently controlled, and ... [trails off]. Alright, I'll leave this on ice for now.