Thread: O_DIRECT setting
A recent comment on this (or perhaps another?) mailing list about Sun boxen and the directio mount option has prompted me to read about O_DIRECT on the open() manpage. Has anybody tried this option? Ever taken any performance measurements? I assume the way postgres manages its buffer memory (dealing with 8kB pages) would be compatible with the restrictions: Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. According to the manpage, O_DIRECT implies O_SYNC: File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of the read(2) or write(2) system call, data is guaranteed to have been transferred. At the moment I am fairly interested in trying this, and I would spend some time with it, but I have my hands full with other projects. I'd imagine this is more use with the revamped buffer manager in PG8.0 than the 7.x line, but we are not using PG8.0 here yet. Would people be interested in a performance benchmark? I need some benchmark tips :) Incidentally, postgres heap files suffer really, really bad fragmentation, which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) quite drastically. We have in-house patches that somewhat alleiviate this, but they are not release quality. Has anybody else suffered this? Guy Thornley
On Mon, 2004-09-20 at 17:57, Guy Thornley wrote: > According to the manpage, O_DIRECT implies O_SYNC: > > File I/O is done directly to/from user space buffers. The I/O is > synchronous, i.e., at the completion of the read(2) or write(2) > system call, data is guaranteed to have been transferred. This seems like it would be a rather large net loss. PostgreSQL already structures writes so that the writes we need to hit disk immediately (WAL records) are fsync()'ed -- the kernel is given more freedom to schedule how other writes are flushed from the cache. Also, my recollection is that O_DIRECT also disables readahead -- if that's correct, that's not what we want either. BTW, using O_DIRECT has been discussed a few times in the past. Have you checked the list archives? (for both -performance and -hackers) > Would people be interested in a performance benchmark? Sure -- I'd definitely be curious, although as I said I'm skeptical it's a win. > I need some benchmark tips :) Some people have noted that it can be difficult to use contrib/pgbench to get reproducible results -- you might want to look at Jan's TPC-W implementation or the OSDL database benchmarks: http://pgfoundry.org/projects/tpc-w-php/ http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/ > Incidentally, postgres heap files suffer really, really bad fragmentation, > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) > quite drastically. We have in-house patches that somewhat alleiviate this, > but they are not release quality. Can you elaborate on these "in-house patches"? -Neil
TODO has: * Consider use of open/fcntl(O_DIRECT) to minimize OS caching Should the item be removed? --------------------------------------------------------------------------- Neil Conway wrote: > On Mon, 2004-09-20 at 17:57, Guy Thornley wrote: > > According to the manpage, O_DIRECT implies O_SYNC: > > > > File I/O is done directly to/from user space buffers. The I/O is > > synchronous, i.e., at the completion of the read(2) or write(2) > > system call, data is guaranteed to have been transferred. > > This seems like it would be a rather large net loss. PostgreSQL already > structures writes so that the writes we need to hit disk immediately > (WAL records) are fsync()'ed -- the kernel is given more freedom to > schedule how other writes are flushed from the cache. Also, my > recollection is that O_DIRECT also disables readahead -- if that's > correct, that's not what we want either. > > BTW, using O_DIRECT has been discussed a few times in the past. Have you > checked the list archives? (for both -performance and -hackers) > > > Would people be interested in a performance benchmark? > > Sure -- I'd definitely be curious, although as I said I'm skeptical it's > a win. > > > I need some benchmark tips :) > > Some people have noted that it can be difficult to use contrib/pgbench > to get reproducible results -- you might want to look at Jan's TPC-W > implementation or the OSDL database benchmarks: > > http://pgfoundry.org/projects/tpc-w-php/ > http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/ > > > Incidentally, postgres heap files suffer really, really bad fragmentation, > > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) > > quite drastically. We have in-house patches that somewhat alleiviate this, > > but they are not release quality. > > Can you elaborate on these "in-house patches"? > > -Neil > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > TODO has: > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching > Should the item be removed? I think it's fine ;-) ... it says "consider it", not "do it". The point is that we could do with more research in this area, even if O_DIRECT per se is not useful. Maybe you could generalize the entry to "investigate ways of fine-tuning OS caching behavior". regards, tom lane
On Mon, Sep 20, 2004 at 07:57:34PM +1200, Guy Thornley wrote: [snip] > > Incidentally, postgres heap files suffer really, really bad fragmentation, > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) > quite drastically. We have in-house patches that somewhat alleiviate this, > but they are not release quality. Has anybody else suffered this? > Any chance I could give those patches a try? I'm interested in seeing how they may affect our DBT-3 workload, which execute DSS type queries. Thanks, Mark
On Thu, Sep 23, 2004 at 10:57:41AM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > TODO has: > > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching > > Should the item be removed? > > I think it's fine ;-) ... it says "consider it", not "do it". The point > is that we could do with more research in this area, even if O_DIRECT > per se is not useful. Maybe you could generalize the entry to > "investigate ways of fine-tuning OS caching behavior". > > regards, tom lane > I talked to Jan a little about this during OSCon since Linux filesystems (ext2, ext3, etc) let you use O_DIRECT. He felt the only place where PostgreSQL may benefit from this now, without managing its own buffer first, would be with the log writer. I'm probably going to get this wrong, but he thought it would be interesting to try an experiment by taking X number of pages to be flushed, sort them (by age? where they go on disk?) and write them out. He thought this would be a relatively easy thing to try, a day or two of work. We'd really love to experiment with it. Mark
Mark Wong <markw@osdl.org> writes: > I talked to Jan a little about this during OSCon since Linux filesystems > (ext2, ext3, etc) let you use O_DIRECT. He felt the only place where > PostgreSQL may benefit from this now, without managing its own buffer first, > would be with the log writer. I'm probably going to get this wrong, but > he thought it would be interesting to try an experiment by taking X number > of pages to be flushed, sort them (by age? where they go on disk?) and > write them out. Hmm. Most of the time the log writer has little choice about page write order --- certainly if all your transactions are small it's not going to have any choice. I think this would mainly be equivalent to O_SYNC with the extra feature of stopping the kernel from buffering the WAL data in its own buffer cache. Which is probably useful, but I doubt it's going to make a huge difference. regards, tom lane
Sorry about the belated reply, its been busy around here. > > Incidentally, postgres heap files suffer really, really bad fragmentation, > > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) > > quite drastically. We have in-house patches that somewhat alleiviate this, > > but they are not release quality. Has anybody else suffered this? > > > > Any chance I could give those patches a try? I'm interested in seeing > how they may affect our DBT-3 workload, which execute DSS type queries. Like I said, the patches are not release quality... if you run them on a metadata journalling filesystem, without an 'ordered write' mode, its possible to end up with corrupt heaps after a crash because of garbage data in the extended files. If/when we move to postgres 8 I'll try to ensure the patches get re-done with releasable quality Guy Thornley
On Thu, Sep 30, 2004 at 07:02:32PM +1200, Guy Thornley wrote: > Sorry about the belated reply, its been busy around here. > > > > Incidentally, postgres heap files suffer really, really bad fragmentation, > > > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...) > > > quite drastically. We have in-house patches that somewhat alleiviate this, > > > but they are not release quality. Has anybody else suffered this? > > > > > > > Any chance I could give those patches a try? I'm interested in seeing > > how they may affect our DBT-3 workload, which execute DSS type queries. > > Like I said, the patches are not release quality... if you run them on a > metadata journalling filesystem, without an 'ordered write' mode, its > possible to end up with corrupt heaps after a crash because of garbage data > in the extended files. > > If/when we move to postgres 8 I'll try to ensure the patches get re-done > with releasable quality > > Guy Thornley That's ok, we like to help test and proof things, we don't need patches to be release quality. Mark