Thread: Use of O_DIRECT only for open_* sync options

Use of O_DIRECT only for open_* sync options

From
Bruce Momjian
Date:
Is there a reason we only use O_DIRECT with open_* sync options?
xlogdefs.h says:

/**  Because O_DIRECT bypasses the kernel buffers, and because we never*  read those buffers except during crash
recovery,it is a win to use*  it in all cases where we sync on each write().  We could allow O_DIRECT*  with fsync(),
butbecause skipping the kernel buffer forces writes out*  quickly, it seems best just to use it for O_SYNC.  It is hard
toimagine*  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.*  Also, O_DIRECT is never enough
toforce data to the drives, it merely*  tries to bypass the kernel cache, so we still need O_SYNC or fsync().*/
 

This seems wrong because fsync() can win if there are two writes before
the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
If that is the cause, we should document it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Use of O_DIRECT only for open_* sync options

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 1:53 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Is there a reason we only use O_DIRECT with open_* sync options?
> xlogdefs.h says:
>
> /*
>  *  Because O_DIRECT bypasses the kernel buffers, and because we never
>  *  read those buffers except during crash recovery, it is a win to use
>  *  it in all cases where we sync on each write().  We could allow O_DIRECT
>  *  with fsync(), but because skipping the kernel buffer forces writes out
>  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
>  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
>  *  Also, O_DIRECT is never enough to force data to the drives, it merely
>  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
>  */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call.

Well, the comment does say "...in all cases where we sync on each
write()".  But that's certainly not true of WAL, so I dunno.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Use of O_DIRECT only for open_* sync options

From
Greg Smith
Date:
Bruce Momjian wrote:
> xlogdefs.h says:
>
> /*
>  *  Because O_DIRECT bypasses the kernel buffers, and because we never
>  *  read those buffers except during crash recovery, it is a win to use
>  *  it in all cases where we sync on each write().  We could allow O_DIRECT
>  *  with fsync(), but because skipping the kernel buffer forces writes out
>  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
>  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
>  *  Also, O_DIRECT is never enough to force data to the drives, it merely
>  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
>  */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
> If that is the cause, we should document it.
>   

The comment does look busted, because you did imagine exactly a case 
where they might be combined.  The only incompatibility that I'm aware 
of is that O_DIRECT requires reads and writes to be aligned properly, so 
you can't use it in random application code unless it's aware of that.  
O_DIRECT and fsync are compatible; for example, MySQL allows combining 
the two:  http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

(That whole bit of documentation around innodb_flush_method includes 
some very interesting observations around O_DIRECT actually)

I'm starting to consider the idea that much of the performance gains 
seen on earlier systems with O_DIRECT was because it decreased CPU usage 
shuffling things into the OS cache, rather than its impact on avoiding 
pollution of said cache.  On Linux for example, its main accomplishment 
is decribed like this:  "File I/O is done directly to/from user space 
buffers."  
http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html  The 
earliest paper on the implementation suggests a big decrease in CPU 
overhead from that:  
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

Impossible to guess whether that's more true ("CPU cache pollution is a 
bigger problem now") or less true ("drives are much slower relative to 
CPUs now") today.  I'm trying to remain agnostic and let the benchmarks 
offer an opinion instead.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Use of O_DIRECT only for open_* sync options

From
Bruce Momjian
Date:
Greg Smith wrote:
> Bruce Momjian wrote:
> > xlogdefs.h says:
> >
> > /*
> >  *  Because O_DIRECT bypasses the kernel buffers, and because we never
> >  *  read those buffers except during crash recovery, it is a win to use
> >  *  it in all cases where we sync on each write().  We could allow O_DIRECT
> >  *  with fsync(), but because skipping the kernel buffer forces writes out
> >  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
> >  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> >  *  Also, O_DIRECT is never enough to force data to the drives, it merely
> >  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> >  */
> >
> > This seems wrong because fsync() can win if there are two writes before
> > the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
> > If that is the cause, we should document it.
> >   
> 
> The comment does look busted, because you did imagine exactly a case 
> where they might be combined.  The only incompatibility that I'm aware 
> of is that O_DIRECT requires reads and writes to be aligned properly, so 
> you can't use it in random application code unless it's aware of that.  
> O_DIRECT and fsync are compatible; for example, MySQL allows combining 
> the two:  http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html
> 
> (That whole bit of documentation around innodb_flush_method includes 
> some very interesting observations around O_DIRECT actually)
> 
> I'm starting to consider the idea that much of the performance gains 
> seen on earlier systems with O_DIRECT was because it decreased CPU usage 
> shuffling things into the OS cache, rather than its impact on avoiding 
> pollution of said cache.  On Linux for example, its main accomplishment 
> is decribed like this:  "File I/O is done directly to/from user space 
> buffers."  
> http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html  The 
> earliest paper on the implementation suggests a big decrease in CPU 
> overhead from that:  
> http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html
> 
> Impossible to guess whether that's more true ("CPU cache pollution is a 
> bigger problem now") or less true ("drives are much slower relative to 
> CPUs now") today.  I'm trying to remain agnostic and let the benchmarks 
> offer an opinion instead.

Agreed.  Perhaps we need a separate setting to turn direct I/O on and
off, and decouple wal_sync_method and direct I/O.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Use of O_DIRECT only for open_* sync options

From
Bruce Momjian
Date:
Greg Smith wrote:
> Bruce Momjian wrote:
> > xlogdefs.h says:
> >
> > /*
> >  *  Because O_DIRECT bypasses the kernel buffers, and because we never
> >  *  read those buffers except during crash recovery, it is a win to use
> >  *  it in all cases where we sync on each write().  We could allow O_DIRECT
> >  *  with fsync(), but because skipping the kernel buffer forces writes out
> >  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
> >  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> >  *  Also, O_DIRECT is never enough to force data to the drives, it merely
> >  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> >  */
> >
> > This seems wrong because fsync() can win if there are two writes before
> > the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
> > If that is the cause, we should document it.
> >   
> 
> The comment does look busted, because you did imagine exactly a case 
> where they might be combined.  The only incompatibility that I'm aware 
> of is that O_DIRECT requires reads and writes to be aligned properly, so 
> you can't use it in random application code unless it's aware of that.  
> O_DIRECT and fsync are compatible; for example, MySQL allows combining 
> the two:  http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

C comment updated in git head:
*  Because O_DIRECT bypasses the kernel buffers, and because we never*  read those buffers except during crash recovery
orif wal_level != minimal,*  it is a win to use it in all cases where we sync on each write().  We could*  allow
O_DIRECTwith fsync(), but it is unclear if fsync() could process*  writes not buffered in the kernel.  Also, O_DIRECT
isnever enough to force*  data to the drives, it merely tries to bypass the kernel cache, so we still*  need
O_SYNC/O_DSYNC.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +