Re: Use of O_DIRECT only for open_* sync options - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Use of O_DIRECT only for open_* sync options
Date
Msg-id 4D3C306F.8030209@2ndquadrant.com
Whole thread Raw
In response to Use of O_DIRECT only for open_* sync options  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Use of O_DIRECT only for open_* sync options  (Bruce Momjian <bruce@momjian.us>)
Re: Use of O_DIRECT only for open_* sync options  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Bruce Momjian wrote:
> xlogdefs.h says:
>
> /*
>  *  Because O_DIRECT bypasses the kernel buffers, and because we never
>  *  read those buffers except during crash recovery, it is a win to use
>  *  it in all cases where we sync on each write().  We could allow O_DIRECT
>  *  with fsync(), but because skipping the kernel buffer forces writes out
>  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
>  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
>  *  Also, O_DIRECT is never enough to force data to the drives, it merely
>  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
>  */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
> If that is the cause, we should document it.
>   

The comment does look busted, because you did imagine exactly a case 
where they might be combined.  The only incompatibility that I'm aware 
of is that O_DIRECT requires reads and writes to be aligned properly, so 
you can't use it in random application code unless it's aware of that.  
O_DIRECT and fsync are compatible; for example, MySQL allows combining 
the two:  http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

(That whole bit of documentation around innodb_flush_method includes 
some very interesting observations around O_DIRECT actually)

I'm starting to consider the idea that much of the performance gains 
seen on earlier systems with O_DIRECT was because it decreased CPU usage 
shuffling things into the OS cache, rather than its impact on avoiding 
pollution of said cache.  On Linux for example, its main accomplishment 
is decribed like this:  "File I/O is done directly to/from user space 
buffers."  
http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html  The 
earliest paper on the implementation suggests a big decrease in CPU 
overhead from that:  
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

Impossible to guess whether that's more true ("CPU cache pollution is a 
bigger problem now") or less true ("drives are much slower relative to 
CPUs now") today.  I'm trying to remain agnostic and let the benchmarks 
offer an opinion instead.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: pg_basebackup for streaming base backups
Next
From: Andy Colson
Date:
Subject: Re: Perl 5.12 complains about ecpg parser-hacking scripts