Re: Windows now has fdatasync() - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Windows now has fdatasync()
Date
Msg-id CA+hUKG+a-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw@mail.gmail.com
Whole thread Raw
In response to Re: Windows now has fdatasync()  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
David kindly ran some tests of this thing on real hardware.  The
results were mostly in line with expectations, but we learned some new
things.

TL;DR  We probably should consider this as a safer default, but it'd
be good for someone more hands-on with this OS and knowledgeable about
storage to investigate and propose that.  My original goal here was
primarily Unix/Windows harmonisation and cleanup since I'm doing a
bunch of hacking on I/O, but I can't unsee an
unsafe-at-least-on-consumer-gear default now that I've seen it.  The
main thing I'm aware of that we don't know yet is what happens if you
try it on a non-NTFS file system (ReFS? SMB?) -- hopefully it falls
back to fsync behaviour.

Observations from an old Windows 8.1 system with a SATA drive:

1.  So far you can apparently still actually compile and run on 8.1,
despite recent commits to de-support it.
2.  You can use the new wal_sync_method=fdatasync, without error, and
timings are consistent with falling back to full fsync behaviour.
That makes sense, I guess, because the function existed.  It's just a
new flag bit, and the default behaviour for flags == 0 was already
their fsync.  That seems like a good outcome even though 8.1 isn't a
target anymore.

Observations from a current Windows 11 system with an NVMe drive:

1.  fdatasync is faster than fsync, as expected.  Twice as fast with
write cache disabled, a bit faster with write cache enabled.
2.  Timings seem to suggest that open_datasync (the current default)
is not really writing through the drive cache.  I'd previously thought
that was a SATA-only problem based on [1], which said that EIDE/SATA
drivers did not pass through the FUA flag that NTFS sends for
FILE_FLAG_WRITE_THROUGH (= O_DSYNC) on the basis that many drives
ignored it anyway, but these numbers seem to suggest that David's
recent-ish NVMe system has the same problem as the old SATA system.

Generally, Windows' approach seems to be that NTFS
FILE_FLAG_WRITE_THROUGH fires an FUA flag into the storage stack, and
either the driver or the drive is free to fling it out the window, and
it's the user's problem to worry about that, whereas Linux at least
asks nicely if the drive understands FUA and falls back to flushing
the whole cache if not[2].  I also know that Linux has been flaky
around this in the past too, especially on consumer storage, and macOS
and at least some of the older BSD/UFS systems just don't do this
stuff at all for user data (yet) so it's not like there is anything
universal about this topic.  Note that drive caches are enabled by
default in Windows, and our manual does already tell you about this
problem[3].

One thing to note about the numbers below: pg_test_fsync.c's
open_datasync test is also using FILE_FLAG_NO_BUFFERING (= O_DIRECT),
unlike PostgreSQL, which muddies the waters slightly.  (There was a
patch upthread to fix that and report both numbers, I may come back to
that.)

Windows 11, NVMe, write cache enabled:

        open_datasync                     27306.286 ops/sec      37 usecs/op
        fdatasync                          3065.428 ops/sec     326 usecs/op
        fsync                              2577.498 ops/sec     388 usecs/op

Windows 11, NVMe, write cache disabled:

        open_datasync                      3477.258 ops/sec     288 usecs/op
        fdatasync                          3263.418 ops/sec     306 usecs/op
        fsync                              1641.502 ops/sec     609 usecs/op

Windows 8.1, SATA:

        open_datasync                     19934.532 ops/sec      50 usecs/op
        fdatasync                           231.429 ops/sec    4321 usecs/op
        fsync                               240.050 ops/sec    4166 usecs/op

(We couldn't figure out how to disable the write cache on the 8.1
machine -- the usual checkbox had no effect -- but we didn't waste
time investigating that old system beyond the curiosity of checking if
it'd work at all.)

[1] https://devblogs.microsoft.com/oldnewthing/20170510-00/?p=95505
[2]
https://techcommunity.microsoft.com/t5/sql-server-blog/sql-server-on-linux-forced-unit-access-fua-internals/ba-p/3199102
[3] https://www.postgresql.org/docs/devel/wal-reliability.html



pgsql-hackers by date:

Previous
From: Lukas Fittl
Date:
Subject: Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled
Next
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: Perform streaming logical transactions by background workers and parallel apply