Direct I/O - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Direct I/O |
Date | |
Msg-id | CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com Whole thread Raw |
Responses |
Re: Direct I/O
Re: Direct I/O Re: Direct I/O Re: Direct I/O |
List | pgsql-hackers |
Hi, Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the AIO patch-set[1]. It adds three new settings, defaulting to off: io_data_direct = whether to use O_DIRECT for main data files io_wal_direct = ... for WAL io_wal_init_direct = ... for WAL-file initialisation O_DIRECT asks the kernel to avoid caching file data as much as possible. Here's a fun quote about it[2]: "The exact semantics of Direct I/O (O_DIRECT) are not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications." It gives the kernel the opportunity to move data directly between PostgreSQL's user space buffers and the storage hardware using DMA hardware, that is, without CPU involvement or copying. Not all storage stacks can do that, for various reasons, but even if not, the caching policy should ideally still use temporary buffers and avoid polluting the page cache. These settings currently destroy performance, and are not intended to be used by end-users, yet! That's why we filed them under DEVELOPER_OPTIONS. You don't get automatic read-ahead, concurrency, clustering or (of course) buffering from the kernel. The idea is that later parts of the AIO patch-set will introduce mechanisms to replace what the kernel is doing for us today, and then more, since we ought to be even better at predicting our own future I/O than it, so that we'll finish up ahead. Even with all that, you wouldn't want to turn it on by default because the default shared_buffers would be insufficient for any real system, and there are portability problems. Examples of slowness: * every 8KB sequential read or write becomes a full round trip to the storage, one at a time * data that is written to WAL and then read back in by WAL sender will incur full I/O round trip (that's probably not really an AIO problem, that's something we should probably address by using shared memory instead of files, as noted as a TODO item in the source code) Memory alignment patches: Direct I/O generally needs to be done to/from VM page-aligned addresses, but only "standard" 4KB pages, even when larger VM pages are in use (if there is an exotic system where that isn't true, it won't work). We need to deal with buffers on the stack, the heap and in shmem. For the stack, see patch 0001. For the heap and shared memory, see patch 0002, but David Rowley is going to propose that part separately, as MemoryContext API adjustments are a specialised enough topic to deserve another thread; here I include a copy as a dependency. The main direct I/O patch is 0003. Assorted portability notes: I expect this to "work" (that is, successfully destroy performance) on typical developer systems running at least Linux, macOS, Windows and FreeBSD. By work, I mean: not be rejected by PostgreSQL, not be rejected by the kernel, and influence kernel cache behaviour on common filesystems. It might be rejected with ENOSUPP, EINVAL etc on some more exotic filesystems and OSes. Of currently supported OSes, only OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the GUCs. For macOS and Windows we internally translate our own PG_O_DIRECT flag to the correct flags/calls (committed a while back[3]). On Windows, scatter/gather is available only with direct I/O, so a true pwritev would in theory be possible, but that has some more complications and is left for later patches (probably using native interfaces, not disguising as POSIX). There may be systems on which 8KB offset alignment will not work at all or not work well, and that's expected. For example, BTRFS, ZFS, JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an 8KB write will have to trigger a read-before-write. Note that offset/length alignment requirements (blocks) are independent of buffer alignment requirements (memory pages, 4KB). The behaviour and cache coherency of files that have open descriptors using both direct and non-direct flags may be complicated and vary between systems. The patch currently lets you change the GUCs at runtime so backends can disagree: that should probably not be allowed, but is like that now for experimentation. More study is required. If someone has a compiler that we don't know how to do pg_attribute_aligned() for, then we can't make correctly aligned stack buffers, so in that case direct I/O is disabled, but I don't know of such a system (maybe aCC, but we dropped it). That's why smgr code can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned() macro, disabling the GUCs. This seems to be an independent enough piece to get into the tree on its own, with the proviso that it's not actually useful yet other than for experimentation. Thoughts? These patches have been hacked on at various times by Andres Freund, David Rowley and me. [1] https://wiki.postgresql.org/wiki/AIO [2] https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics [3] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com
Attachment
pgsql-hackers by date: