Thread: O_DIRECT on macOS

O_DIRECT on macOS

From
Thomas Munro
Date:
Hi,

IRIX gave the world O_DIRECT, and then every Unix I've used followed
their lead except Apple's, which gave the world fcntl(fd, F_NOCACHE,
1).  From what I could find in public discussion, this API difference
may stem from the caching policy being controlled at the per-file
(vnode) level in older macOS (and perhaps ancestors), but since 10.4
it's per file descriptor, so approximately like O_DIRECT on other
systems.  The precise effects and constraints of O_DIRECT/F_NOCACHE
are different across operating systems and file systems in some subtle
and not-so-subtle ways, but the general concept is the same: try to
avoid buffering.

I thought about a few different ways to encapsulate this API
difference in PostgreSQL, and toyed with two:

1.  We could define our own fake O_DIRECT flag, and translate that to
the right thing inside BasicOpenFilePerm().  That seems a bit icky.
We'd have to be careful not to collide with system defined flags and
worry about changes.  We do that sort of thing for Windows, though
that's a bit different, there we translate *all* the flags from
POSIXesque to Windowsian.

2.  We could make an extended BasicOpenFilePerm() variant that takes a
separate boolean parameter for direct, so that we don't have to hijack
any flag space, but now we need new interfaces just to tolerate a
rather niche system.

Here's a draft patch like #2, just for discussion.  Better ideas?

The reason I want to get direct I/O working on this "client" OS is
because the AIO project will propose to use direct I/O for the buffer
pool as an option, and I would like Macs to be able to do that
primarily for the sake of developers trying out the patch set.  Based
on memories from the good old days of attending conferences, a decent
percentage of PostgreSQL developers are on Macs.

As it stands, the patch only actually has any effect if you set
wal_level=minimal and max_wal_senders=0, which is a configuration that
I guess almost no-one uses.  Otherwise xlog.c assumes that the
filesystem is going to be used for data exchange with replication
processes (something we should replace with WAL buffers in shmem some
time soon) so for now it's better to keep the data in page cache since
it'll be accessed again soon.

Unfortunately, this change makes pg_test_fsync show a very slightly
lower number for open_data_sync on my ancient Intel Mac, but
pg_test_fsync isn't really representative anymore since minimal
logging is by now unusual (I guess pg_test_fsync would ideally do the
test with and without direct to make that clearer).  Whether this is a
good option for the WAL is separate from whether it's a good option
for relation data (ie a way to avoid large scale double buffering, but
have new, different problems), and later patches will propose new
separate GUCs to control that.

Attachment

Re: O_DIRECT on macOS

From
Justin Pryzby
Date:
On Sun, May 30, 2021 at 04:39:48PM +1200, Thomas Munro wrote:

> +BasicOpenFilePermDirect(const char *fileName, int fileFlags, mode_t fileMode,
> +                                               bool direct)
> ...
> +#if !defined(O_DIRECT) && defined(F_NOCACHE)
> +       /* macOS requires an extra step. */
> +       if (direct && fcntl(fd, F_NOCACHE, 1) < 0)
> +       {
> +               int                     save_errno = errno;
> +
> +               close(fd);
> +               errno = save_errno;
> +               ereport(ERROR,
> +                               (errcode_for_file_access(),
> +                                errmsg("could not disable kernel file caching for file \"%s\": %m",
> +                                               fileName)));
> +       }
> +#endif

Should there be an "else" to warn/error in the case that "direct" is requested
but not supported?

-- 
Justin



Re: O_DIRECT on macOS

From
Andres Freund
Date:
Hi,

Thanks for starting the discussion on this!

On 2021-05-30 16:39:48 +1200, Thomas Munro wrote:
> I thought about a few different ways to encapsulate this API
> difference in PostgreSQL, and toyed with two:
> 
> 1.  We could define our own fake O_DIRECT flag, and translate that to
> the right thing inside BasicOpenFilePerm().  That seems a bit icky.
> We'd have to be careful not to collide with system defined flags and
> worry about changes.  We do that sort of thing for Windows, though
> that's a bit different, there we translate *all* the flags from
> POSIXesque to Windowsian.
> 
> 2.  We could make an extended BasicOpenFilePerm() variant that takes a
> separate boolean parameter for direct, so that we don't have to hijack
> any flag space, but now we need new interfaces just to tolerate a
> rather niche system.

I don't think 2) really covers the problem on its own. It's fine for
things that directly use BasicOpenFilePerm(), but what about "virtual
file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There
we need to store the fact that we want non-buffered IO as part of the
vfd, otherwise we'll loose that information when re-opening the file
later.

Greetings,

Andres Freund



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Mon, May 31, 2021 at 8:12 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-30 16:39:48 +1200, Thomas Munro wrote:
> > I thought about a few different ways to encapsulate this API
> > difference in PostgreSQL, and toyed with two:
> >
> > 1.  We could define our own fake O_DIRECT flag, and translate that to
> > the right thing inside BasicOpenFilePerm().  That seems a bit icky.
> > We'd have to be careful not to collide with system defined flags and
> > worry about changes.  We do that sort of thing for Windows, though
> > that's a bit different, there we translate *all* the flags from
> > POSIXesque to Windowsian.
> >
> > 2.  We could make an extended BasicOpenFilePerm() variant that takes a
> > separate boolean parameter for direct, so that we don't have to hijack
> > any flag space, but now we need new interfaces just to tolerate a
> > rather niche system.
>
> I don't think 2) really covers the problem on its own. It's fine for
> things that directly use BasicOpenFilePerm(), but what about "virtual
> file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There
> we need to store the fact that we want non-buffered IO as part of the
> vfd, otherwise we'll loose that information when re-opening the file
> later.

Right, a bit more API perturbation is required to traffic the separate
flags around for VFDs, which is all a bit unpleasant for a feature
that most people don't care about.

For comparison, here is my sketch of idea #1.  I pick an arbitrary
value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
of breaking other code that might see it and try to pass it into
open()... for all I know, it might happen to match OS-internal value
O_NASAL_DEMONS), and statically assert that it doesn't collide with
standard flags we're using, and I strip it out of the flags I pass in
to open().  As I said, a bit icky, but it's a tiny and localised
patch, which is nice.

I also realised that it probably wasn't right to raise an ERROR, so in
this version I return -1 when fcntl() fails.

Attachment

Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Mon, May 31, 2021 at 4:19 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Should there be an "else" to warn/error in the case that "direct" is requested
> but not supported?

The way we use O_DIRECT currently is extremely minimal, it's just "if
you've got it, we'll use it, but otherwise not complain", and I wasn't
trying to change that yet, but you're right that if we add explicit
GUCs to turn on direct I/O for WAL and data files we should definitely
not let you turn them on if we can't do it.



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Mon, May 31, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> For comparison, here is my sketch of idea #1.  I pick an arbitrary
> value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
> of breaking other code that might see it and try to pass it into
> open()... for all I know, it might happen to match OS-internal value
> O_NASAL_DEMONS), and statically assert that it doesn't collide with
> standard flags we're using, and I strip it out of the flags I pass in
> to open().  As I said, a bit icky, but it's a tiny and localised
> patch, which is nice.

I'm planning to go with that idea (#1), if there are no objections.



Re: O_DIRECT on macOS

From
Andres Freund
Date:
Hi,

On 2021-07-13 13:25:50 +1200, Thomas Munro wrote:
> On Mon, May 31, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > For comparison, here is my sketch of idea #1.  I pick an arbitrary
> > value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
> > of breaking other code that might see it and try to pass it into
> > open()... for all I know, it might happen to match OS-internal value
> > O_NASAL_DEMONS), and statically assert that it doesn't collide with
> > standard flags we're using, and I strip it out of the flags I pass in
> > to open().  As I said, a bit icky, but it's a tiny and localised
> > patch, which is nice.
> 
> I'm planning to go with that idea (#1), if there are no objections.

The only other viable approach I see is to completely separate our
internal flag representation from the OS representation and do the whole
mapping inside fd.c - but that seems like a too big hammer right now.

Greetings,

Andres Freund



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Tue, Jul 13, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-07-13 13:25:50 +1200, Thomas Munro wrote:
> > I'm planning to go with that idea (#1), if there are no objections.
>
> The only other viable approach I see is to completely separate our
> internal flag representation from the OS representation and do the whole
> mapping inside fd.c - but that seems like a too big hammer right now.

Agreed.  Pushed!

For the record, Solaris has directio() that could be handled the same
way.  I'm not planning to look into that myself, but patches welcome.
Illumos (née OpenSolaris) got with the programme and added O_DIRECT.
Of our 10-or-so target systems I guess that'd leave just HPUX (judging
by an old man page found on the web) and OpenBSD with no direct I/O
support.



Re: O_DIRECT on macOS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Agreed.  Pushed!

prairiedog thinks that Assert is too optimistic about whether all
those flags exist.

            regards, tom lane



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> prairiedog thinks that Assert is too optimistic about whether all
> those flags exist.

Fixed.

(Huh, I received no -committers email for 2dbe8905.)



Re: O_DIRECT on macOS

From
John Naylor
Date:

On Mon, Jul 19, 2021 at 12:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > prairiedog thinks that Assert is too optimistic about whether all
> > those flags exist.
>
> Fixed.
>
> (Huh, I received no -committers email for 2dbe8905.)

It didn't show up in the archives, either. Neither did your follow-up 04cad8f7bc.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: O_DIRECT on macOS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> prairiedog thinks that Assert is too optimistic about whether all
>> those flags exist.

> Fixed.

Hmm ... we used to have to avoid putting #if constructs in the arguments
of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
with C99, and in any case this whole stanza is fairly platform-specific
so we may not run into a compiler that complains.  But my hindbrain wants
to see this done with separate statements, eg

#if defined(O_CLOEXEC)
    StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
                     "PG_O_DIRECT collides with O_CLOEXEC");
#endif

            regards, tom lane



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Tue, Jul 20, 2021 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm ... we used to have to avoid putting #if constructs in the arguments
> of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
> with C99, and in any case this whole stanza is fairly platform-specific
> so we may not run into a compiler that complains.  But my hindbrain wants
> to see this done with separate statements, eg
>
> #if defined(O_CLOEXEC)
>     StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
>                      "PG_O_DIRECT collides with O_CLOEXEC");
> #endif

Ok, done.

While I was here again, I couldn't resist trying to extend this to
Solaris, since it looked so easy.  I don't have access, but I tested
on Illumos by undefining O_DIRECT.  Thoughts?

Attachment

Re: O_DIRECT on macOS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> While I was here again, I couldn't resist trying to extend this to
> Solaris, since it looked so easy.  I don't have access, but I tested
> on Illumos by undefining O_DIRECT.  Thoughts?

I can try that on the gcc farm in a bit.

            regards, tom lane



Re: O_DIRECT on macOS

From
Tom Lane
Date:
I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> While I was here again, I couldn't resist trying to extend this to
>> Solaris, since it looked so easy.  I don't have access, but I tested
>> on Illumos by undefining O_DIRECT.  Thoughts?

> I can try that on the gcc farm in a bit.

Hmm, it compiles cleanly, but something seems drastically wrong,
because performance is just awful.  On the other hand, I don't
know what sort of storage is underlying this instance, so maybe
that's to be expected?  If I set fsync = off, the speed seems
comparable to what wrasse reports, but with fsync on it's like

test tablespace                   ... ok        87990 ms
parallel group (20 tests, in groups of 1):  boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric
txiduuid enum money rangetypes pg_lsn regproc 
     boolean                      ... ok         3229 ms
     char                         ... ok         2758 ms
     name                         ... ok         2229 ms
     varchar                      ... ok         7373 ms
     text                         ... ok          722 ms
     int2                         ... ok          342 ms
     int4                         ... ok         1303 ms
     int8                         ... ok         1095 ms
     oid                          ... ok         1086 ms
     float4                       ... ok         6360 ms
     float8                       ... ok         5224 ms
     bit                          ... ok         6254 ms
     numeric                      ... ok        44304 ms
     txid                         ... ok          377 ms
     uuid                         ... ok         3946 ms
     enum                         ... ok        33189 ms
     money                        ... ok          622 ms
     rangetypes                   ... ok        17301 ms
     pg_lsn                       ... ok          798 ms
     regproc                      ... ok          145 ms

(I stopped running it at that point...)

Also, the results of pg_test_fsync seem wrong; it refuses to run
tests for the cases we're interested in:

$ pg_test_fsync
5 seconds per test
DIRECTIO_ON supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                  n/a*
        fdatasync                             8.324 ops/sec  120139 usecs/op
        fsync                                 0.906 ops/sec  1103936 usecs/op
        fsync_writethrough                              n/a
        open_sync                                      n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                  n/a*
        fdatasync                             7.329 ops/sec  136449 usecs/op
        fsync                                 0.788 ops/sec  1269258 usecs/op
        fsync_writethrough                              n/a
        open_sync                                      n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write                      n/a*
         2 *  8kB open_sync writes                     n/a*
         4 *  4kB open_sync writes                     n/a*
         8 *  2kB open_sync writes                     n/a*
        16 *  1kB open_sync writes                     n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                  16.388 ops/sec   61020 usecs/op
        write, close, fsync                   9.084 ops/sec  110082 usecs/op

Non-sync'ed 8kB writes:
        write                             39855.686 ops/sec      25 usecs/op



            regards, tom lane



Re: O_DIRECT on macOS

From
Thomas Munro
Date:
On Tue, Jul 20, 2021 at 12:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I can try that on the gcc farm in a bit.

Thanks!

> Hmm, it compiles cleanly, but something seems drastically wrong,
> because performance is just awful.  On the other hand, I don't
> know what sort of storage is underlying this instance, so maybe
> that's to be expected?

Ouch.  I assume this was without wal_method=minimal (or it'd have
reached the new code and failed completely, based on the pg_test_fsync
result).

>         open_datasync                                  n/a*

I'm waiting for access, but I see from man pages that closed source
ZFS doesn't accept DIRECTIO_ON, so it may not be possible to see it
work on an all-ZFS system that you can't mount a new FS on.  Hmm.
Well, many OSes have file systems that can't do it (ext4 journal=data,
etc).  One problem is that we don't treat all OSes the same when
selecting wal_sync_method, even though O_DIRECT is complicated on many
OSes.  It would also be nice if the choice to use direct I/O were
independently controlled, and ... [trails off].  Alright, I'll leave
this on ice for now.