Hi,
Continuing a topic from earlier threads[1][2], I've been wondering
about how to de-klugify wal_sync_method=fsync_writethrough (a setting
that actually affects much more than just WAL), and how to do the
right thing for our users on macOS and Windows by default. Commit
d0c28601 was a very small cleanup in this area. Here are some bigger
ideas I'd like to try out.
Short version:
* Make wal_sync_method=fdatasync the default everywhere
* Drop wal_sync_method=fsync_writethrough
* Add new macOS-only level for the fsync GUC: fsync=full
* Make fsync=full redirect both pg_fsync() and pg_fdatasync()
* Make fsync=full the default on macOS
Motivation:
I think expectations might have changed quite a bit since ~2000. Back
then, fsync() didn't flush write caches on any OS (you were supposed
to use battery-backed controllers and SCSI as found on expensive
proprietary Unix systems if you were serious, IDE/ATA protocols didn't
originally have flush commands, and some consumer drives famously
ignored them or lied, so users of cheap drives were advised to turn
write caches off). Around 2005, Linux decided to start sending the
flush command in fsync(). Windows' FlushFileBuffers() does the same,
and I gathered from Raymond Chen's blog that by the Windows 8
timeframe all consumer drive vendors supported and respected the flush
command. macOS *still* doesn't send it for fsync(), but has had
fcntl(F_FULLFSYNC) since 2003. In Apple's defence, they seem to have
been ahead of the curve on this problem[3]... I suppose they didn't
anticipate that everyone else was going to do it in their main
fsync()/fdatasync() call, they blazed their own trail, and now it all
looks a bit weird.
In other words, back then all systems running PostgreSQL risked data
loss unless you had fancy hardware or turned off unsafe caching. But
now, due to the changing landscape and our policy choices, that is
true only for rarer systems by default while most in our community are
on Linux where this is all just a historical footnote. People's
baseline expectations have moved, and although we try to document the
situation, they are occasionally very surprised: "Loaded footgun
open_datasync on Windows" was Laurenz Albe's reaction[4] to those
paragraphs. Surely we should be able to recover after power loss by
default even on a lowly desktop PC or basic server loaded with SATA
drives, out of the box?
Proposal for Windows:
The existing default use of FILE_FLAG_WRITE_THROUGH is probably a
better choice on hardware where it works reliably (cache disabled,
non-volatile cache, or working FUA support), since it skips a system
call and doesn't wait for incidental other stuff in the cache to
flush, but it's well documented that Windows' SATA drivers neither
pass the "FUA" flag down to the device nor fall back to sending a full
cache flush command. It's also easy to see in the pg_test_fsync
numbers, which are too good to be true on consumer gear. Therefore
wal_sync_method=fdatasync is a better default level. We map that to
NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY). (The "SYNC" in
that flag name means flush the drive cache; the "DATA...ONLY" in that
flag name means skip non-essential stuff like file modification time
etc just like fdatasync() in POSIX, and goes visibly faster thanks to
not journaling metadata.)
Proposal for macOS:
Our current default isn't nice to users who run a database on
mains-powered Macs. I don't have one myself to try it, but "man
fsync" clearly states that you can lose data and it is easily
demonstrated with a traditional cord-yanking test[5]. You could
certainly lose some recent commits; you could probably also get more
subtle corruption or a total recovery failure like [6] too, if for
example the control file can make it to durable storage and while
pointing to a checkpoint that did not (maybe a ZFS-like atomic
root-switch prevents that sort of disorder in APFS, I dunno, but I
read some semi-informed speculation that it doesn't work that way
*shrug*).
We do currently offer a non-default setting
wal_sync_method=fsync_writethough to address all this already.
Despite its name, it affects every caller of pg_fsync() (control file,
data files, etc). It's certainly essential to flush all those files
fully too as part of our recovery protocol, but they're not "WAL".
The new idea here is to provide a separate way of controlling that
global behaviour, and I propose fsync=full. Furthermore, I think that
setting should also affect pg_fdatasync(), given that Apple doesn't
even really have fdatasync() (perhaps if they carry out their threat
to implement it, they'll also invent F_FULLFDATASYNC; for now it
*seems* to be basically just another name for fsync() albeit
undeclared by <unistd.h>).
It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other
errors in obscure cases (eg unusual file systems). In that case, you
could manually lower fsync to just "on" and do your own research on
whether power loss can toast your database, but that doesn't seem like
a reason for us not to ship good solid defaults for typical users.
Rationale for changing wal_sync_method globally (for now):
With wal_sync_method=fdatasync as default for Linux, FreeBSD, OpenBSD,
DragonflyBSD already, if we added macOS and Windows, that'd leave only
NetBSD, AIX, Solaris/illumos. I don't like having different and more
magical defaults on rare target OSes with no expert users left in our
community (as [6] reminded me), so I figure we'd be better off with
the same less magical setting everywhere, as a baseline.
Later we might want a per-platform default again. For example, Linux
(like Windows) has policies on whether to believe FUA works reliably
for the purposes of O_DSYNC, but (unlike Windows) falls back to
sending cache flushes instead of doing nothing, so in theory
open_datasync might be a safe and sometimes better performing default
there. If we decided to do that, we'd just restore the
PLATFORM_DEFAULT_SYNC_METHOD mechanism.
The only other OS where I have detailed enough knowledge to comment is
FreeBSD. Its ZFS flushes caches for all levels just fine, so it
doesn't much matter, while its UFS never got that memo (so it's like a
Mac and probably other old Unixes; maybe I'll get that fixed, see
FreeBSD proposal D36371 if interested). The reasons for using
fdatasync on both FreeBSD and Linux wasn't cache control policies, but
rather some obscure logic of ours that would turn on O_DIRECT in some
cases (and I think in the past when wal_level was lower by default, it
would have been common), which might have complications or fail. The
last trace of that is gone since d4e71df6, so if we were to put Linux
on a 'known-good-for-open_datasync' list I'd probably also consider
putting FreeBSD on the list too.
Note that while this'll slow down some real world databases by being
more careful, 'meson test' time shouldn't be affected on any OS due to
use of fsync=off in tests.
Draft patches attached.
[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[3] https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html
[4] https://www.postgresql.org/message-id/flat/1527846213.2475.31.camel%40cybertec.at
[5] https://news.ycombinator.com/item?id=30372194
[6] https://www.postgresql.org/message-id/flat/18009-40a42f84af3fbda1%40postgresql.org