Thread: Volatile write caches on macOS and Windows, redux

Volatile write caches on macOS and Windows, redux

From
Thomas Munro
Date:
Hi,

Continuing a topic from earlier threads[1][2], I've been wondering
about how to de-klugify wal_sync_method=fsync_writethrough (a setting
that actually affects much more than just WAL), and how to do the
right thing for our users on macOS and Windows by default.  Commit
d0c28601 was a very small cleanup in this area.  Here are some bigger
ideas I'd like to try out.

Short version:

 * Make wal_sync_method=fdatasync the default everywhere
 * Drop wal_sync_method=fsync_writethrough
 * Add new macOS-only level for the fsync GUC: fsync=full
 * Make fsync=full redirect both pg_fsync() and pg_fdatasync()
 * Make fsync=full the default on macOS

Motivation:

I think expectations might have changed quite a bit since ~2000.  Back
then, fsync() didn't flush write caches on any OS (you were supposed
to use battery-backed controllers and SCSI as found on expensive
proprietary Unix systems if you were serious, IDE/ATA protocols didn't
originally have flush commands, and some consumer drives famously
ignored them or lied, so users of cheap drives were advised to turn
write caches off).  Around 2005, Linux decided to start sending the
flush command in fsync().  Windows' FlushFileBuffers() does the same,
and I gathered from Raymond Chen's blog that by the Windows 8
timeframe all consumer drive vendors supported and respected the flush
command.  macOS *still* doesn't send it for fsync(), but has had
fcntl(F_FULLFSYNC) since 2003.  In Apple's defence, they seem to have
been ahead of the curve on this problem[3]... I suppose they didn't
anticipate that everyone else was going to do it in their main
fsync()/fdatasync() call, they blazed their own trail, and now it all
looks a bit weird.

In other words, back then all systems running PostgreSQL risked data
loss unless you had fancy hardware or turned off unsafe caching.  But
now, due to the changing landscape and our policy choices, that is
true only for rarer systems by default while most in our community are
on Linux where this is all just a historical footnote.  People's
baseline expectations have moved, and although we try to document the
situation, they are occasionally very surprised: "Loaded footgun
open_datasync on Windows" was Laurenz Albe's reaction[4] to those
paragraphs.  Surely we should be able to recover after power loss by
default even on a lowly desktop PC or basic server loaded with SATA
drives, out of the box?

Proposal for Windows:

The existing default use of FILE_FLAG_WRITE_THROUGH is probably a
better choice on hardware where it works reliably (cache disabled,
non-volatile cache, or working FUA support), since it skips a system
call and doesn't wait for incidental other stuff in the cache to
flush, but it's well documented that Windows' SATA drivers neither
pass the "FUA" flag down to the device nor fall back to sending a full
cache flush command.  It's also easy to see in the pg_test_fsync
numbers, which are too good to be true on consumer gear.  Therefore
wal_sync_method=fdatasync is a better default level.  We map that to
NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY).  (The "SYNC" in
that flag name means flush the drive cache; the "DATA...ONLY" in that
flag name means skip non-essential stuff like file modification time
etc just like fdatasync() in POSIX, and goes visibly faster thanks to
not journaling metadata.)

Proposal for macOS:

Our current default isn't nice to users who run a database on
mains-powered Macs.  I don't have one myself to try it, but "man
fsync" clearly states that you can lose data and it is easily
demonstrated with a traditional cord-yanking test[5].  You could
certainly lose some recent commits; you could probably also get more
subtle corruption or a total recovery failure like [6] too, if for
example the control file can make it to durable storage and while
pointing to a checkpoint that did not (maybe a ZFS-like atomic
root-switch prevents that sort of disorder in APFS, I dunno, but I
read some semi-informed speculation that it doesn't work that way
*shrug*).

We do currently offer a non-default setting
wal_sync_method=fsync_writethough to address all this already.
Despite its name, it affects every caller of pg_fsync() (control file,
data files, etc).  It's certainly essential to flush all those files
fully too as part of our recovery protocol, but they're not "WAL".
The new idea here is to provide a separate way of controlling that
global behaviour, and I propose fsync=full.  Furthermore, I think that
setting should also affect pg_fdatasync(), given that Apple doesn't
even really have fdatasync() (perhaps if they carry out their threat
to implement it, they'll also invent F_FULLFDATASYNC; for now it
*seems* to be basically just another name for fsync() albeit
undeclared by <unistd.h>).

It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other
errors in obscure cases (eg unusual file systems).  In that case, you
could manually lower fsync to just "on" and do your own research on
whether power loss can toast your database, but that doesn't seem like
a reason for us not to ship good solid defaults for typical users.

Rationale for changing wal_sync_method globally (for now):

With wal_sync_method=fdatasync as default for Linux, FreeBSD, OpenBSD,
DragonflyBSD already, if we added macOS and Windows, that'd leave only
NetBSD, AIX, Solaris/illumos.  I don't like having different and more
magical defaults on rare target OSes with no expert users left in our
community (as [6] reminded me), so I figure we'd be better off with
the same less magical setting everywhere, as a baseline.

Later we might want a per-platform default again.  For example, Linux
(like Windows) has policies on whether to believe FUA works reliably
for the purposes of O_DSYNC, but (unlike Windows) falls back to
sending cache flushes instead of doing nothing, so in theory
open_datasync might be a safe and sometimes better performing default
there.  If we decided to do that, we'd just restore the
PLATFORM_DEFAULT_SYNC_METHOD mechanism.

The only other OS where I have detailed enough knowledge to comment is
FreeBSD.  Its ZFS flushes caches for all levels just fine, so it
doesn't much matter, while its UFS never got that memo (so it's like a
Mac and probably other old Unixes; maybe I'll get that fixed, see
FreeBSD proposal D36371 if interested).  The reasons for using
fdatasync on both FreeBSD and Linux wasn't cache control policies, but
rather some obscure logic of ours that would turn on O_DIRECT in some
cases (and I think in the past when wal_level was lower by default, it
would have been common), which might have complications or fail.  The
last trace of that is gone since d4e71df6, so if we were to put Linux
on a 'known-good-for-open_datasync' list I'd probably also consider
putting FreeBSD on the list too.

Note that while this'll slow down some real world databases by being
more careful, 'meson test' time shouldn't be affected on any OS due to
use of fsync=off in tests.

Draft patches attached.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[3] https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html
[4] https://www.postgresql.org/message-id/flat/1527846213.2475.31.camel%40cybertec.at
[5] https://news.ycombinator.com/item?id=30372194
[6] https://www.postgresql.org/message-id/flat/18009-40a42f84af3fbda1%40postgresql.org

Attachment

Re: Volatile write caches on macOS and Windows, redux

From
Peter Smith
Date:
2024-01 Commitfest.

Hi, this patch was marked in CF as "Needs Review" [1], but there has
been no activity on this thread for 6+ months.

Is anything else planned, or can you post something to elicit more
interest in the patch? Otherwise, if nothing happens then the CF entry
will be closed ("Returned with feedback") at the end of this CF.

======
[1] https://commitfest.postgresql.org/46/4453/

Kind Regards,
Peter Smith.



Re: Volatile write caches on macOS and Windows, redux

From
vignesh C
Date:
On Mon, 22 Jan 2024 at 07:46, Peter Smith <smithpb2250@gmail.com> wrote:
>
> 2024-01 Commitfest.
>
> Hi, this patch was marked in CF as "Needs Review" [1], but there has
> been no activity on this thread for 6+ months.
>
> Is anything else planned, or can you post something to elicit more
> interest in the patch? Otherwise, if nothing happens then the CF entry
> will be closed ("Returned with feedback") at the end of this CF.

With no update to the thread and the patch not applying I'm marking
this as returned with feedback.  Please feel free to resubmit to the
next CF when there is a new version of the patch.

Regards,
Vignesh



Re: Volatile write caches on macOS and Windows, redux

From
Thomas Munro
Date:
Short sales pitch for these patches:

* the default settings eat data on Macs and Windows
* nobody understands what wal_sync_method=fsync_writethrough means anyway
* it's a weird kludge that it affects not only WAL, let's clean that up



Re: Volatile write caches on macOS and Windows, redux

From
Nathan Bossart
Date:
On Thu, Mar 14, 2024 at 01:12:05PM +1300, Thomas Munro wrote:
> Short sales pitch for these patches:
> 
> * the default settings eat data on Macs and Windows
> * nobody understands what wal_sync_method=fsync_writethrough means anyway
> * it's a weird kludge that it affects not only WAL, let's clean that up

I recently started using macOS for hacking on Postgres and noticed this
problem, so I was delighted to find this thread.  I intend to review
further soon, but +1 for improving the default settings.  I think we might
also need some additional fcntl(F_FULLFSYNC) calls in sync_pgdata(),
sync_dir_recurse(), etc., which are used by initdb, pg_basebackup, and
more.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Volatile write caches on macOS and Windows, redux

From
Jelte Fennema-Nio
Date:
On Tue, 18 Jul 2023 at 05:29, Thomas Munro <thomas.munro@gmail.com> wrote:
> It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other
> errors in obscure cases (eg unusual file systems).  In that case, you
> could manually lower fsync to just "on" and do your own research on
> whether power loss can toast your database, but that doesn't seem like
> a reason for us not to ship good solid defaults for typical users.

Is this the only reason why you're suggesting adding fsync=full,
instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
If so, I'm not sure we really gain anything by this tri-state. I think
people either care about data loss on power loss, or they don't. I
doubt many people want his third intermediate option, which afaict
basically means lose data on powerloss less often than fsync=false but
still lose data most of the time.

If you're going to keep this tri-state for MacOS, then it still seems
nicer to me to "fix" fsync=true on MacOS and introduce a fsync=partial
or something. Then defaults are the same across platforms and anyone
setting fsync=yes currently in their postgresql.conf would get the
fixed behaviour on upgrade.



Re: Volatile write caches on macOS and Windows, redux

From
Peter Eisentraut
Date:
On 25.05.24 04:01, Jelte Fennema-Nio wrote:
> Is this the only reason why you're suggesting adding fsync=full,
> instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
> If so, I'm not sure we really gain anything by this tri-state. I think
> people either care about data loss on power loss, or they don't. I
> doubt many people want his third intermediate option, which afaict
> basically means lose data on powerloss less often than fsync=false but
> still lose data most of the time.

I agree, two states should be enough.  It could basically just be

pg_fsync(int fd)
{
#if macos
     fcntl(fd, F_FULLFSYNC);
#else
     fsync(fd);
#endif
}




Re: Volatile write caches on macOS and Windows, redux

From
Nathan Bossart
Date:
On Wed, May 29, 2024 at 06:49:57AM -0700, Peter Eisentraut wrote:
> On 25.05.24 04:01, Jelte Fennema-Nio wrote:
>> Is this the only reason why you're suggesting adding fsync=full,
>> instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
>> If so, I'm not sure we really gain anything by this tri-state. I think
>> people either care about data loss on power loss, or they don't. I
>> doubt many people want his third intermediate option, which afaict
>> basically means lose data on powerloss less often than fsync=false but
>> still lose data most of the time.
> 
> I agree, two states should be enough.  It could basically just be
> 
> pg_fsync(int fd)
> {
> #if macos
>     fcntl(fd, F_FULLFSYNC);
> #else
>     fsync(fd);
> #endif
> }

IIUC with this approach, anyone who is using a file system that fails
fcntl(F_FULLSYNC) with ENOSUPP would have to turn fsync off.  That might be
the right thing to do since having a third option that sends the data to
the disk cache but doesn't provide any real guarantees if you lose power
may not be worth much.  However, if such a file system _did_ provide such
guarantees with just fsync(), then it would be unfortunate to force people
to turn fsync off.  But this could very well all be hypothetical, for all I
know...  In any case, I agree that we should probably use F_FULLFSYNC by
default on macOS.

-- 
nathan



Re: Volatile write caches on macOS and Windows, redux

From
Peter Eisentraut
Date:
On 03.06.24 17:28, Nathan Bossart wrote:
>> I agree, two states should be enough.  It could basically just be
>>
>> pg_fsync(int fd)
>> {
>> #if macos
>>      fcntl(fd, F_FULLFSYNC);
>> #else
>>      fsync(fd);
>> #endif
>> }
> IIUC with this approach, anyone who is using a file system that fails
> fcntl(F_FULLSYNC) with ENOSUPP would have to turn fsync off.  That might be
> the right thing to do since having a third option that sends the data to
> the disk cache but doesn't provide any real guarantees if you lose power
> may not be worth much.  However, if such a file system_did_  provide such
> guarantees with just fsync(), then it would be unfortunate to force people
> to turn fsync off.  But this could very well all be hypothetical, for all I
> know...  In any case, I agree that we should probably use F_FULLFSYNC by
> default on macOS.

Yeah, my example code above says "#if macos", not "#ifdef F_FULLSYNC". 
The latter might be a problem along the lines you describe if other 
systems use that symbol in a slightly different manner.