Direct I/O - Mailing list pgsql-hackers

From Thomas Munro
Subject Direct I/O
Date
Msg-id CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
Whole thread Raw
Responses Re: Direct I/O
Re: Direct I/O
Re: Direct I/O
Re: Direct I/O
List pgsql-hackers
Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the
AIO patch-set[1].  It adds three new settings, defaulting to off:

  io_data_direct = whether to use O_DIRECT for main data files
  io_wal_direct = ... for WAL
  io_wal_init_direct = ... for WAL-file initialisation

O_DIRECT asks the kernel to avoid caching file data as much as
possible.  Here's a fun quote about it[2]:

"The exact semantics of Direct I/O (O_DIRECT) are not well specified.
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour has
generally been passed down as oral lore rather than as a formal set of
requirements and specifications."

It gives the kernel the opportunity to move data directly between
PostgreSQL's user space buffers and the storage hardware using DMA
hardware, that is, without CPU involvement or copying.  Not all
storage stacks can do that, for various reasons, but even if not, the
caching policy should ideally still use temporary buffers and avoid
polluting the page cache.

These settings currently destroy performance, and are not intended to
be used by end-users, yet!  That's why we filed them under
DEVELOPER_OPTIONS.  You don't get automatic read-ahead, concurrency,
clustering or (of course) buffering from the kernel.  The idea is that
later parts of the AIO patch-set will introduce mechanisms to replace
what the kernel is doing for us today, and then more, since we ought
to be even better at predicting our own future I/O than it, so that
we'll finish up ahead.  Even with all that, you wouldn't want to turn
it on by default because the default shared_buffers would be
insufficient for any real system, and there are portability problems.

Examples of slowness:

* every 8KB sequential read or write becomes a full round trip to the
storage, one at a time

* data that is written to WAL and then read back in by WAL sender will
incur full I/O round trip (that's probably not really an AIO problem,
that's something we should probably address by using shared memory
instead of files, as noted as a TODO item in the source code)

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work).  We need to deal with buffers on the stack, the heap and
in shmem.  For the stack, see patch 0001.  For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency.  The main direct I/O patch is 0003.

Assorted portability notes:

I expect this to "work" (that is, successfully destroy performance) on
typical developer systems running at least Linux, macOS, Windows and
FreeBSD.  By work, I mean: not be rejected by PostgreSQL, not be
rejected by the kernel, and influence kernel cache behaviour on common
filesystems.  It might be rejected with ENOSUPP, EINVAL etc on some
more exotic filesystems and OSes.  Of currently supported OSes, only
OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the
GUCs.  For macOS and Windows we internally translate our own
PG_O_DIRECT flag to the correct flags/calls (committed a while
back[3]).

On Windows, scatter/gather is available only with direct I/O, so a
true pwritev would in theory be possible, but that has some more
complications and is left for later patches (probably using native
interfaces, not disguising as POSIX).

There may be systems on which 8KB offset alignment will not work at
all or not work well, and that's expected.  For example, BTRFS, ZFS,
JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an
8KB write will have to trigger a read-before-write.  Note that
offset/length alignment requirements (blocks) are independent of
buffer alignment requirements (memory pages, 4KB).

The behaviour and cache coherency of files that have open descriptors
using both direct and non-direct flags may be complicated and vary
between systems.  The patch currently lets you change the GUCs at
runtime so backends can disagree: that should probably not be allowed,
but is like that now for experimentation.  More study is required.

If someone has a compiler that we don't know how to do
pg_attribute_aligned() for, then we can't make correctly aligned stack
buffers, so in that case direct I/O is disabled, but I don't know of
such a system (maybe aCC, but we dropped it).  That's why smgr code
can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and
why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned()
macro, disabling the GUCs.

This seems to be an independent enough piece to get into the tree on
its own, with the proviso that it's not actually useful yet other than
for experimentation.  Thoughts?

These patches have been hacked on at various times by Andres Freund,
David Rowley and me.

[1] https://wiki.postgresql.org/wiki/AIO
[2] https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
[3]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Improve description of XLOG_RUNNING_XACTS
Next
From: Michael Paquier
Date:
Subject: Re: [Refactor]Avoid to handle FORCE_NOT_NULL/FORCE_NULL options when COPY TO