Re: O_DIRECT for WAL writes - Mailing list pgsql-patches
From | Mary Edie Meredith |
---|---|
Subject | Re: O_DIRECT for WAL writes |
Date | |
Msg-id | 1117738168.2922.411.camel@localhost Whole thread Raw |
In response to | Re: O_DIRECT for WAL writes (Neil Conway <neilc@samurai.com>) |
Responses |
Re: O_DIRECT for WAL writes
|
List | pgsql-patches |
On Thu, 2005-06-02 at 11:39 +1000, Neil Conway wrote: > On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote: > > I know I'm late to this discussion, and I haven't made it all the way > > through this thread to see if your questions on Linux writes were > > resolved. If you are still interested, I recommend read a very good > > one page description of reliable writes buried in the Data Center Linux > > Goals and Capabilities document. > > This suggests that on Linux a write() on a file opened with O_DIRECT has > the same synchronization guarantees as a write() on a file opened with > O_SYNC, which is precisely the opposite of what was concluded down > thread. So now I'm more confused :) > > (Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave > this way on all platforms -- for example, FreeBSD's open(2) manpage does > not mention I/O synchronization when referring to O_DIRECT. So even if > we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to > do that on all platforms.) My understanding is that O_DIRECT means "direct" as in "no buffering by the OS" which implies that if you write from your buffer, the write is not going to return unless the OS thinks the write is completed (or unless you are using Async IO). Otherwise you might reuse your buffer (there _is no other buffer after all) and if the write were incomplete before refill you buffer for another, the first write might go from your buffer with wrong data. Now if you want to avoid _waiting for the write to complete, you need to employ async io, which is why most databases that support direct io for their datafiles also have implemented some form of async io as well (either via OS calls or some built-in mechanism as is the case with SAP-DB). With AIO you have to manage your buffers so that you reuse them only when you are notified the IO is completed. Historically this was done with raw datafiles, but currently (at least for Linux) you can also do this with files. For logging, though, I think you want synchronous IO to guarantee order. The cool thing about buffering the datafile data yourself is that _you (the database engine) can control what stays in (shared) memory and what does not. You can add configuration options or add intelligence, so that frequently used data (like hot indexes) can stay in memory indefinitely. The OS can never do that so specifically. In addition, you can avoid having data from table scans overwrite hot objects. Of course, at the moment you are discussing the use for logging, but there should be benefits to extending this to datafiles as well, assuming you also implement async io. Bottom line: if you do not implement direct/async IO so that you optimize caching of hot database objects and minimize memory utilization of objects used once, you are probably leaving performance on the table for datafiles. Daniel is on vacation, but I will ask him to confirm once he returns. > > -Neil > -- Mary Edie Meredith maryedie@osdl.org 503-906-1942 Data Center Linux Initiative Manager Open Source Development Labs
pgsql-patches by date: