Re: [HACKERS] O_DIRECT for WAL writes - Mailing list pgsql-patches
From | Mark Wong |
---|---|
Subject | Re: [HACKERS] O_DIRECT for WAL writes |
Date | |
Msg-id | 20050806210419.GA31044@osdl.org Whole thread Raw |
In response to | Re: [HACKERS] O_DIRECT for WAL writes (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: [HACKERS] O_DIRECT for WAL writes
|
List | pgsql-patches |
Here are comments that Daniel McNeil made earlier, which I've neglected to forward earlier. I've cc'ed him and Mark Havercamp, which some of you got to meet the other day. Mark ----- With O_DIRECT on Linux, when the write() returns the i/o has been transferred to the disk. Normally, this i/o will be DMAed directly from user-space to the device. The current exception is when doing an O_DIRECT write to a hole in a file. (If an program does a truncate() or lseek()/write() that makes a file larger, the file system does not allocated space between the old end of file and the new end of file.) An O_DIRECT write to hole like this, requires the file system to allocated space, but there is a race condition between the O_DIRECT write doing the allocate and then write to initialized the newly allocated data and any other process that attempts a buffered (page cache) read of the same area in the file -- it was possible for the read to data from the allocated region before the O_DIRECT write(). The fix in Linux is for the O_DIRECT write() to fall back to use buffer i/o to do the write() and flush the data from the page cache to the disk. A write() with O_DIRECT only means the data has been transferred to the disk. Depending on the file system and mount options, it does not mean the meta data for the file has been written to disk (see fsync man page). Fsync() will guarantee the data and metadata have been written to disk. Lastly, if a disk has a write back cache, an O_DIRECT write() does not guarantee that the disk has put the data on the physical media. I think some of the journal file systems now support i/o barriers on commit which will flush the disk write back cache. (I'm still looking the kernel code to see how this is done). Conclusion: O_DIRECT + fsync() can make sense. It avoids the copying of data to the page cache before being written and will also guarantee that the file's metadata is also written to disk. It also prevents the page cache from filling up with write data that will never be read (I assume it is only read if a recovery is necessary - which should be rare). It can also helps disks with write back cache when using the journaling file system that use i/o barriers. You would want to use large writes, since the kernel page cache won't be writing multiple pages for you. I need to look at the kernel code more to comment on O_DIRECT with O_SYNC. Questions: Does the database transaction logger preallocate the log file? Does the logger care about the order in which each write hits the disk? Now someone else can comment on my comments. Daniel
pgsql-patches by date: