Thread: O_DIRECT setting

O_DIRECT setting

From
Guy Thornley
Date:
A recent comment on this (or perhaps another?) mailing list about Sun boxen
and the directio mount option has prompted me to read about O_DIRECT on the
open() manpage.

Has anybody tried this option? Ever taken any performance measurements?
I assume the way postgres manages its buffer memory (dealing with 8kB pages)
would be compatible with the restrictions:

        Under  Linux  2.4 transfer  sizes,  and the alignment of user buffer
        and file offset must all be multiples of the logical block size of
        the file system.

According to the manpage, O_DIRECT implies O_SYNC:

        File I/O is done directly to/from user space buffers.  The I/O is
        synchronous, i.e., at the completion of the read(2) or write(2)
        system call, data is guaranteed to have been transferred.

At the moment I am fairly interested in trying this, and I would spend some
time with it, but I have my hands full with other projects. I'd imagine this
is more use with the revamped buffer manager in PG8.0 than the 7.x line, but
we are not using PG8.0 here yet.

Would people be interested in a performance benchmark? I need some benchmark
tips :)

Incidentally, postgres heap files suffer really, really bad fragmentation,
which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
quite drastically. We have in-house patches that somewhat alleiviate this,
but they are not release quality. Has anybody else suffered this?

Guy Thornley


Re: O_DIRECT setting

From
Neil Conway
Date:
On Mon, 2004-09-20 at 17:57, Guy Thornley wrote:
> According to the manpage, O_DIRECT implies O_SYNC:
>
>         File I/O is done directly to/from user space buffers.  The I/O is
>         synchronous, i.e., at the completion of the read(2) or write(2)
>         system call, data is guaranteed to have been transferred.

This seems like it would be a rather large net loss. PostgreSQL already
structures writes so that the writes we need to hit disk immediately
(WAL records) are fsync()'ed -- the kernel is given more freedom to
schedule how other writes are flushed from the cache. Also, my
recollection is that O_DIRECT also disables readahead -- if that's
correct, that's not what we want either.

BTW, using O_DIRECT has been discussed a few times in the past. Have you
checked the list archives? (for both -performance and -hackers)

> Would people be interested in a performance benchmark?

Sure -- I'd definitely be curious, although as I said I'm skeptical it's
a win.

> I need some benchmark tips :)

Some people have noted that it can be difficult to use contrib/pgbench
to get reproducible results -- you might want to look at Jan's TPC-W
implementation or the OSDL database benchmarks:

http://pgfoundry.org/projects/tpc-w-php/
http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/

> Incidentally, postgres heap files suffer really, really bad fragmentation,
> which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
> quite drastically. We have in-house patches that somewhat alleiviate this,
> but they are not release quality.

Can you elaborate on these "in-house patches"?

-Neil



Re: O_DIRECT setting

From
Bruce Momjian
Date:
TODO has:

    * Consider use of open/fcntl(O_DIRECT) to minimize OS caching

Should the item be removed?

---------------------------------------------------------------------------

Neil Conway wrote:
> On Mon, 2004-09-20 at 17:57, Guy Thornley wrote:
> > According to the manpage, O_DIRECT implies O_SYNC:
> >
> >         File I/O is done directly to/from user space buffers.  The I/O is
> >         synchronous, i.e., at the completion of the read(2) or write(2)
> >         system call, data is guaranteed to have been transferred.
>
> This seems like it would be a rather large net loss. PostgreSQL already
> structures writes so that the writes we need to hit disk immediately
> (WAL records) are fsync()'ed -- the kernel is given more freedom to
> schedule how other writes are flushed from the cache. Also, my
> recollection is that O_DIRECT also disables readahead -- if that's
> correct, that's not what we want either.
>
> BTW, using O_DIRECT has been discussed a few times in the past. Have you
> checked the list archives? (for both -performance and -hackers)
>
> > Would people be interested in a performance benchmark?
>
> Sure -- I'd definitely be curious, although as I said I'm skeptical it's
> a win.
>
> > I need some benchmark tips :)
>
> Some people have noted that it can be difficult to use contrib/pgbench
> to get reproducible results -- you might want to look at Jan's TPC-W
> implementation or the OSDL database benchmarks:
>
> http://pgfoundry.org/projects/tpc-w-php/
> http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/
>
> > Incidentally, postgres heap files suffer really, really bad fragmentation,
> > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
> > quite drastically. We have in-house patches that somewhat alleiviate this,
> > but they are not release quality.
>
> Can you elaborate on these "in-house patches"?
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: O_DIRECT setting

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> TODO has:
>     * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> Should the item be removed?

I think it's fine ;-) ... it says "consider it", not "do it".  The point
is that we could do with more research in this area, even if O_DIRECT
per se is not useful.  Maybe you could generalize the entry to
"investigate ways of fine-tuning OS caching behavior".

            regards, tom lane

Re: O_DIRECT setting

From
Mark Wong
Date:
On Mon, Sep 20, 2004 at 07:57:34PM +1200, Guy Thornley wrote:
[snip]
>
> Incidentally, postgres heap files suffer really, really bad fragmentation,
> which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
> quite drastically. We have in-house patches that somewhat alleiviate this,
> but they are not release quality. Has anybody else suffered this?
>

Any chance I could give those patches a try?  I'm interested in seeing
how they may affect our DBT-3 workload, which execute DSS type queries.

Thanks,
Mark

Re: O_DIRECT setting

From
Mark Wong
Date:
On Thu, Sep 23, 2004 at 10:57:41AM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > TODO has:
> >     * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> > Should the item be removed?
>
> I think it's fine ;-) ... it says "consider it", not "do it".  The point
> is that we could do with more research in this area, even if O_DIRECT
> per se is not useful.  Maybe you could generalize the entry to
> "investigate ways of fine-tuning OS caching behavior".
>
>             regards, tom lane
>

I talked to Jan a little about this during OSCon since Linux filesystems
(ext2, ext3, etc) let you use O_DIRECT.  He felt the only place where
PostgreSQL may benefit from this now, without managing its own buffer first,
would be with the log writer.  I'm probably going to get this wrong, but
he thought it would be interesting to try an experiment by taking X number
of pages to be flushed, sort them (by age? where they go on disk?) and
write them out.  He thought this would be a relatively easy thing to try,
a day or two of work.  We'd really love to experiment with it.

Mark

Re: O_DIRECT setting

From
Tom Lane
Date:
Mark Wong <markw@osdl.org> writes:
> I talked to Jan a little about this during OSCon since Linux filesystems
> (ext2, ext3, etc) let you use O_DIRECT.  He felt the only place where
> PostgreSQL may benefit from this now, without managing its own buffer first,
> would be with the log writer.  I'm probably going to get this wrong, but
> he thought it would be interesting to try an experiment by taking X number
> of pages to be flushed, sort them (by age? where they go on disk?) and
> write them out.

Hmm.  Most of the time the log writer has little choice about page write
order --- certainly if all your transactions are small it's not going to
have any choice.  I think this would mainly be equivalent to O_SYNC with
the extra feature of stopping the kernel from buffering the WAL data in
its own buffer cache.  Which is probably useful, but I doubt it's going
to make a huge difference.

            regards, tom lane

Re: O_DIRECT setting

From
Guy Thornley
Date:
Sorry about the belated reply, its been busy around here.

> > Incidentally, postgres heap files suffer really, really bad fragmentation,
> > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
> > quite drastically. We have in-house patches that somewhat alleiviate this,
> > but they are not release quality. Has anybody else suffered this?
> >
>
> Any chance I could give those patches a try?  I'm interested in seeing
> how they may affect our DBT-3 workload, which execute DSS type queries.

Like I said, the patches are not release quality... if you run them on a
metadata journalling filesystem, without an 'ordered write' mode, its
possible to end up with corrupt heaps after a crash because of garbage data
in the extended files.

If/when we move to postgres 8 I'll try to ensure the patches get re-done
with releasable quality

Guy Thornley

Re: O_DIRECT setting

From
Mark Wong
Date:
On Thu, Sep 30, 2004 at 07:02:32PM +1200, Guy Thornley wrote:
> Sorry about the belated reply, its been busy around here.
>
> > > Incidentally, postgres heap files suffer really, really bad fragmentation,
> > > which affects sequential scan operations (VACUUM, ANALYZE, REINDEX ...)
> > > quite drastically. We have in-house patches that somewhat alleiviate this,
> > > but they are not release quality. Has anybody else suffered this?
> > >
> >
> > Any chance I could give those patches a try?  I'm interested in seeing
> > how they may affect our DBT-3 workload, which execute DSS type queries.
>
> Like I said, the patches are not release quality... if you run them on a
> metadata journalling filesystem, without an 'ordered write' mode, its
> possible to end up with corrupt heaps after a crash because of garbage data
> in the extended files.
>
> If/when we move to postgres 8 I'll try to ensure the patches get re-done
> with releasable quality
>
> Guy Thornley

That's ok, we like to help test and proof things, we don't need patches to be
release quality.

Mark