Re: O_DIRECT for WAL writes - Mailing list pgsql-patches

From Mary Edie Meredith
Subject Re: O_DIRECT for WAL writes
Date
Msg-id 1117738168.2922.411.camel@localhost
Whole thread Raw
In response to Re: O_DIRECT for WAL writes  (Neil Conway <neilc@samurai.com>)
Responses Re: O_DIRECT for WAL writes
List pgsql-patches
On Thu, 2005-06-02 at 11:39 +1000, Neil Conway wrote:
> On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> > I know I'm late to this discussion, and I haven't made it all the way
> > through this thread to see if your questions on Linux writes were
> > resolved.   If you are still interested, I recommend read a very good
> > one page description of reliable writes buried in the Data Center Linux
> > Goals and Capabilities document.
>
> This suggests that on Linux a write() on a file opened with O_DIRECT has
> the same synchronization guarantees as a write() on a file opened with
> O_SYNC, which is precisely the opposite of what was concluded down
> thread. So now I'm more confused :)
>
> (Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
> this way on all platforms -- for example, FreeBSD's open(2) manpage does
> not mention I/O synchronization when referring to O_DIRECT. So even if
> we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
> do that on all platforms.)

My understanding is that O_DIRECT means "direct" as in "no buffering by
the OS" which implies  that if you write from your buffer, the write is
not going to return unless the OS thinks the write is completed (or
unless you are using Async IO).  Otherwise you might reuse your buffer
(there _is no other buffer after all) and if the write were incomplete
before refill you buffer for another, the first write might go from your
buffer with wrong data.

Now if you want to avoid _waiting for the write to complete, you need to
employ async io, which is why most databases that support direct io for
their datafiles also have implemented some form of async io as well
(either via OS calls or some built-in mechanism as is the case with
SAP-DB). With AIO you have to manage your buffers so that you reuse them
only when you are notified the IO is completed.  Historically this was
done with raw datafiles, but currently (at least for Linux) you can also
do this with files.  For logging, though, I think you want synchronous
IO to guarantee order.

The cool thing about buffering the datafile data yourself is that _you
(the database engine) can control what stays in (shared) memory and what
does not.  You can add configuration options or add intelligence, so
that frequently used data (like hot indexes) can stay in memory
indefinitely.  The OS can never do that so specifically.  In addition,
you can avoid having data from table scans overwrite hot objects.  Of
course, at the moment you are discussing the use for logging, but there
should be benefits to extending this to datafiles as well, assuming you
also implement async io.

Bottom line: if you do not implement direct/async IO so that you
optimize caching of hot database objects and minimize memory utilization
of objects used once, you are probably leaving performance on the table
for datafiles.

Daniel is on vacation, but I will ask him to confirm once he returns.
>
> -Neil
>
--
Mary Edie Meredith
maryedie@osdl.org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


pgsql-patches by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Oracle date type compat. functions: next_day, last_day,
Next
From: "Andrew Dunstan"
Date:
Subject: Re: [Plperlng-devel] Re: return_next for plperl (was Re: call