Re: O_DIRECT for WAL writes - Mailing list pgsql-patches

From Ron Mayer
Subject Re: O_DIRECT for WAL writes
Date
Msg-id 429AC920.6080809@cheapcomplexdevices.com
Whole thread Raw
In response to Re: O_DIRECT for WAL writes  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
>>is opening a file with O_DIRECT sufficient to ensure that
>>a write(2) does not return until the data has hit disk?
>
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html

Really?  On that page I read:
  "O_DIRECT...at the completion of the read(2) or write(2)
   system call, data is guaranteed to have been transferred."
which sounds to me like transfered to the device's cache
but not necessarily flushed through the device's cache.
It says nothing about physical media.  That wording feels
different to me from O_SYNC which reads:
  "O_SYNC will block the calling process until the data has
   been physically written to the underlying hardware."
which does suggest to me that it writes to physical media.
Or am I reading that wrong?



PS: I've gotten way out of my depth here, but...

     ...attempting to browse the Linux source(!!)

   Looking at the O_SYNC stuff in ext3:
       http://lxr.linux.no/source/fs/ext3/file.c#L67
   it looks like in this conditional:
    if (file->f_flags & O_SYNC) {
       ...
       goto force_commit;
    }
   the goto branch calls ext3_force_commit() in much the
   same way that it seems fsync() does here:
       http://lxr.linux.no/source/fs/ext3/fsync.c#L71
   so I believe O_SYNC does at least as much as fsync().

   However I can't find O_DIRECT anywhere in the ext3 stuff,
   so if it does work it's less obvious how or if it could.

   Moreover I see O_SYNC used lots of places:
       http://lxr.linux.no/ident?i=O_SYNC
   in various places like fs/ext3/; and and I don't
   see O_DIRECT in nearly as many places
       http://lxr.linux.no/ident?i=O_DIRECT
   It looks like reiserfs and xfs seem look at O_DIRECT,
   but ext3 doesn't appear to unless it's somewhere
   outside the fs/ext3 directory.


PPS: Of course not even fsync() flushed correctly until very recent kernels:
     http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
     In that article Jeff Garzik (the linux SATA driver guy) suggests
     that until very recent kernels ext3 did not have write barrier
     support that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE
     (SCSI) commands even on fsync.


PPPS: No, I don't understand the kernel - I'm just showing what quick
       grep commands showed without any deep understanding.

pgsql-patches by date:

Previous
From: Neil Conway
Date:
Subject: Re: skip FK trigger on UPDATE
Next
From: "Greg Sabino Mullane"
Date:
Subject: Re: psql backslash consistency