Thread: wal_sync_method=fsync_writethrough

wal_sync_method=fsync_writethrough

From
Thomas Munro
Date:
Hi,

We allow $SUBJECT on Windows.  I'm not sure exactly how we finished up
with that, maybe a historical mistake, but I find it misleading today.
Modern Windows flushes drive write caches for fsync (= _commit()) and
fdatasync (= FLUSH_FLAGS_FILE_DATA_SYNC_ONLY).  In fact it is possible
to tell Windows to write out file data without flushing the drive
cache (= FLUSH_FLAGS_NO_SYNC), but I don't believe anyone is
interested in new weaker levels.  Any reason not to just get rid of
it?

On macOS, our fsync and fdatasync levels *don't* flush drive caches,
because those system calls don't on that OS, and they offer a weird
special fcntl, so there we offer $SUBJECT for a good reason.  Now that
macOS 10.2 systems are thoroughly extinct, I think we might as well
drop the configure probe, though, while we're doing a lot of that sort
of thing.

The documentation also says a couple of things that aren't quite
correct about wal_sync_level.  (I would also like to revise other
nearby outdated paragraphs about volatile write caches, sector sizes
etc, but that'll take some more research.)

Attachment

Re: wal_sync_method=fsync_writethrough

From
Magnus Hagander
Date:
On Fri, Aug 26, 2022 at 6:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Hi,
>
> We allow $SUBJECT on Windows.  I'm not sure exactly how we finished up
> with that, maybe a historical mistake, but I find it misleading today.
> Modern Windows flushes drive write caches for fsync (= _commit()) and
> fdatasync (= FLUSH_FLAGS_FILE_DATA_SYNC_ONLY).  In fact it is possible
> to tell Windows to write out file data without flushing the drive
> cache (= FLUSH_FLAGS_NO_SYNC), but I don't believe anyone is
> interested in new weaker levels.  Any reason not to just get rid of
> it?

So, I don't know how it works now, but the history at least was this:
it was not about the disk caches, it was about raid controller caches.

Basically, we determined that windows didn't fsync it all the way. But
it would with  But if we changed wal_sync_method=fsync to actually
*do* that, then people who had paid big money for raid controllers
with flash or battery backed cache would lose a ton of performance. So
we needed one level that would sync out of the OS but not through the
RAID cache, and another one that would sync it out of the RAID cache
as well. Which would/could be different from the drive caches
themselves, and they often behaved differently. And I think it may
have even been dependent on the individual RAID drivers what the
default would  be.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: wal_sync_method=fsync_writethrough

From
Thomas Munro
Date:
On Sat, Aug 27, 2022 at 12:17 AM Magnus Hagander <magnus@hagander.net> wrote:
> So, I don't know how it works now, but the history at least was this:
> it was not about the disk caches, it was about raid controller caches.
> Basically, we determined that windows didn't fsync it all the way. But
> it would with  But if we changed wal_sync_method=fsync to actually
> *do* that, then people who had paid big money for raid controllers
> with flash or battery backed cache would lose a ton of performance. So
> we needed one level that would sync out of the OS but not through the
> RAID cache, and another one that would sync it out of the RAID cache
> as well. Which would/could be different from the drive caches
> themselves, and they often behaved differently. And I think it may
> have even been dependent on the individual RAID drivers what the
> default would  be.

Thanks for the background.  Yeah, that makes sense to motivate
open_datasync for Windows.  Not sure what you meant about fsync or
meant to write after "would with".

It seems like the 2005 discussions were primarily about open_datasync
but also had the by-product of introducing the name
fsync_writethrough.  If I'm reading between the lines[1] correctly,
perhaps the logic went like this:

1.  We noticed that _commit() AKA FlushFileBuffers() issued
SYNCHRONIZE CACHE (or equivalent) on Windows.

2.  At that time in history, Linux (and other Unixes) probably did not
issue SYNCHRONIZE CACHE when you called fsync()/fdatasync().

3.  We concluded therefore that Windows was strange and we needed to
use a different level name for the setting to reflect this extra
effect.

Now it looks strange: we have both "fsync" and "fsync_writethrough"
doing exactly the same thing while vaguely implying otherwise, and the
contrast with other operating systems (if I divined that aspect
correctly) mostly doesn't apply.  How flush commands affect various
caches in modern storage stacks is also not really OS-specific AFAIK.

(Obviously macOS is a different story...)

[1] https://www.postgresql.org/message-id/flat/26109.1111084860%40sss.pgh.pa.us#e7f8c2e14d76cad76b1857e89c8a6314



Re: wal_sync_method=fsync_writethrough

From
Magnus Hagander
Date:
On Fri, Aug 26, 2022 at 11:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Aug 27, 2022 at 12:17 AM Magnus Hagander <magnus@hagander.net> wrote:
> > So, I don't know how it works now, but the history at least was this:
> > it was not about the disk caches, it was about raid controller caches.
> > Basically, we determined that windows didn't fsync it all the way. But
> > it would with  But if we changed wal_sync_method=fsync to actually
> > *do* that, then people who had paid big money for raid controllers
> > with flash or battery backed cache would lose a ton of performance. So
> > we needed one level that would sync out of the OS but not through the
> > RAID cache, and another one that would sync it out of the RAID cache
> > as well. Which would/could be different from the drive caches
> > themselves, and they often behaved differently. And I think it may
> > have even been dependent on the individual RAID drivers what the
> > default would  be.
>
> Thanks for the background.  Yeah, that makes sense to motivate
> open_datasync for Windows.  Not sure what you meant about fsync or
> meant to write after "would with".

That's a good question indeed :) I think I meant it would with
FILE_FLAG_WRITE_THROUGH.


> It seems like the 2005 discussions were primarily about open_datasync
> but also had the by-product of introducing the name
> fsync_writethrough.  If I'm reading between the lines[1] correctly,
> perhaps the logic went like this:
>
> 1.  We noticed that _commit() AKA FlushFileBuffers() issued
> SYNCHRONIZE CACHE (or equivalent) on Windows.
>
> 2.  At that time in history, Linux (and other Unixes) probably did not
> issue SYNCHRONIZE CACHE when you called fsync()/fdatasync().

I think it may have been driver dependent there (as well), at the time.


> 3.  We concluded therefore that Windows was strange and we needed to
> use a different level name for the setting to reflect this extra
> effect.

It was certainly strange to us :)


> Now it looks strange: we have both "fsync" and "fsync_writethrough"
> doing exactly the same thing while vaguely implying otherwise, and the
> contrast with other operating systems (if I divined that aspect
> correctly) mostly doesn't apply.  How flush commands affect various
> caches in modern storage stacks is also not really OS-specific AFAIK.
>
> (Obviously macOS is a different story...)

Given that it does vary (because macOS is actually an OS :D), we might
need to start from a matrix of exactly what happens in different
states, and then try to map that to a set? I fully agree that if
things actually behave the same, they should be called the same.

And it may also be that there is no longer a difference between
direct-drive and RAID-with-battery-or-flash, which used to be the huge
difference back then, where you had to tune for it. For many cases
that has been negated by just not using that (and using NVME and
possibly software raid instead), but there are certainly still people
using such systems...

//Magnus



Re: wal_sync_method=fsync_writethrough

From
Thomas Munro
Date:
On Tue, Aug 30, 2022 at 3:44 AM Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Aug 26, 2022 at 11:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Now it looks strange: we have both "fsync" and "fsync_writethrough"
> > doing exactly the same thing while vaguely implying otherwise, and the
> > contrast with other operating systems (if I divined that aspect
> > correctly) mostly doesn't apply.  How flush commands affect various
> > caches in modern storage stacks is also not really OS-specific AFAIK.
> >
> > (Obviously macOS is a different story...)
>
> Given that it does vary (because macOS is actually an OS :D), we might
> need to start from a matrix of exactly what happens in different
> states, and then try to map that to a set? I fully agree that if
> things actually behave the same, they should be called the same.

Thanks, I'll take that as a +1 for dropping the redundant level for
Windows.  (Of course it stays for macOS).

I like that our current levels are the literal names of standard
interfaces we call, since the rest is out of our hands.  I'm not sure
what you could actually *do* with the information that some OS doesn't
flush write caches, other than document it and suggest a remedy (e.g.
turn it off).  I would even prefer it if fsync_writethrough were
called F_FULLFSYNC, following that just-say-what-it-does-directly
philosophy, but that horse is already over the horizon.

> And it may also be that there is no longer a difference between
> direct-drive and RAID-with-battery-or-flash, which used to be the huge
> difference back then, where you had to tune for it. For many cases
> that has been negated by just not using that (and using NVME and
> possibly software raid instead), but there are certainly still people
> using such systems...

I believe modern systems are a lot better at negotiating the need for
flushes (i.e. for *volatile* caches).  In contrast, the FUA situation
(as used for FILE_FLAG_WRITE_THROUGH) seems like a multi-level
dumpster fire on anything but high-end gear, from what I've been able
to figure out so far, though I'm no expert.