Thread: Re: [PATCHES] O_DIRECT for WAL writes

Re: [PATCHES] O_DIRECT for WAL writes

From
ITAGAKI Takahiro
Date:
Hi all,
O_DIRECT for WAL writes was discussed at
http://archives.postgresql.org/pgsql-patches/2005-06/msg00064.php
but I have some items that want to be discussed, so I would like to
re-post it to HACKERS.


Bruce Momjian <pgman@candle.pha.pa.us> wrote:

> I think the conclusion from the discussion is that O_DIRECT is in
> addition to the sync method, rather than in place of it, because
> O_DIRECT doesn't have the same media write guarantees as fsync().  Would
> you update the patch to do and see if there is a performance win?

I tested two combinations, - fsync_direct: O_DIRECT+fsync() - open_direct: O_DIRECT+O_SYNC
to compare them with O_DIRECT on my linux machine.
The pgbench results still shows a performance win:

scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct | open_direct
-----+--------+-----------+--------------+--------------+--------------+--------------- 10 |  150MB | 252.6 tps |
263.5(+4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%)100 |  1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)|
148.9(+45.0%)|150.8(+46.8%)   60runs * pgbench -c 10 -t 1000   on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
 

O_DIRECT, fsync_direct and open_direct show the same tendency of performance.
There were a win on scale=100, but no win on scale=10, which is a fully
in-memory benchmark.

The following items still want to be discussed:
- Are their names appropriate?   Simplify to 'direct'?
- Are both fsync_direct and open_direct necessary?   MySQL seems to use only O_DIRECT+fsync() combination.
- Is it ok to set the dio buffer alignment to BLCKSZ?   This is simple way to set the alignment to match many
environment.  If it is not enough, BLCKSZ would be also a problem for direct io.
 



BTW, IMHO the major benefit of direct io is saving memory. O_DIRECT gives
a hint that OS should not cache WAL files. Without direct io, OS might make
a effort to cache WAL files, which will never be used, and might discard
data file cache.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories




Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> I tested two combinations,
>   - fsync_direct: O_DIRECT+fsync()
>   - open_direct: O_DIRECT+O_SYNC
> to compare them with O_DIRECT on my linux machine.
> The pgbench results still shows a performance win:

> scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct | open_direct
> -----+--------+-----------+--------------+--------------+--------------+---------------
>   10 |  150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%)
>  100 |  1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%)
>     60runs * pgbench -c 10 -t 1000
>     on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

Unfortunately, I cannot believe these numbers --- the near equality of
fsync off and fsync on means there is something very wrong with the
measurements.  What I suspect is that your ATA drives are doing write
caching and thus the "fsyncs" are not really waiting for I/O at all.
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Josh Berkus
Date:
Takahiro,

> scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct |
> open_direct
> -----+--------+-----------+--------------+--------------+--------------+
>--------------- 10 |  150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)|
> 253.6(+ 0.4%)| 253.3(+ 0.3%) 100 |  1.5GB | 102.7 tps | 117.8(+14.7%)|
> 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%) 60runs * pgbench -c 10 -t
> 1000
>     on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

This looks pretty good.   I'd like to try it out on some of our tests.   
Will get back to you on this, but it looks  to me like the O_DIRECT 
results are good enough to consider accepting the patch.

What filesystem and mount options did you use for this test?

> - Are both fsync_direct and open_direct necessary?
>     MySQL seems to use only O_DIRECT+fsync() combination.

MySQL doesn't support as many operating systems as we do.   What OSes and 
versions will support O_DIRECT?


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: [PATCHES] O_DIRECT for WAL writes

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Unfortunately, I cannot believe these numbers --- the near equality of
> fsync off and fsync on means there is something very wrong with the
> measurements.  What I suspect is that your ATA drives are doing write
> caching and thus the "fsyncs" are not really waiting for I/O at all.

I wonder whether it would make sense to have an automatic test for this
problem. I suspect there are lots of installations out there whose admins
don't realize that their hardware is doing this to them.

It shouldn't be too hard to test a few hundred or even a few thousand fsyncs
and calculate the seek time. If it implies a rotational speed over 15kRPM then
you know the drive is lying and the data storage is unreliable.

-- 
greg



Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Unfortunately, I cannot believe these numbers --- the near equality of
>> fsync off and fsync on means there is something very wrong with the
>> measurements.  What I suspect is that your ATA drives are doing write
>> caching and thus the "fsyncs" are not really waiting for I/O at all.

> I wonder whether it would make sense to have an automatic test for this
> problem. I suspect there are lots of installations out there whose admins
> don't realize that their hardware is doing this to them.

Not sure about "automatic", but a simple little test program to measure
the speed of rewriting/fsyncing a small test file would surely be a nice
thing to have.

The reason I question "automatic" is that you really want to test each
drive being used, if the system has more than one; but Postgres has no
idea what the actual hardware layout is, and so no good way to know what
needs to be tested.
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Curt Sampson
Date:
On Thu, 22 Jun 2005, Greg Stark wrote:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>> Unfortunately, I cannot believe these numbers --- the near equality of
>> fsync off and fsync on means there is something very wrong with the
>> measurements.  What I suspect is that your ATA drives are doing write
>> caching and thus the "fsyncs" are not really waiting for I/O at all.
>
> I wonder whether it would make sense to have an automatic test for this
> problem. I suspect there are lots of installations out there whose admins
> don't realize that their hardware is doing this to them.

But is it really a problem? I somewhere got the impression that some
drives, on power failure, will be able to keep going for long enough to
write out the cache and park the heads anyway. If so, the drive is still
guaranteeing the write.

But regardless, perhaps we can add some stuff to the various OSes'
startup scripts that could help with this. For example, in NetBSD you
can "dkctl <device> setcache r" for most any disk device (certainly all
SCSI and ATA) to enable the read cache and disable the write cache.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org     Make up enjoying your city life...produced
byBIC CAMERA
 


Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> But regardless, perhaps we can add some stuff to the various OSes'
> startup scripts that could help with this. For example, in NetBSD you
> can "dkctl <device> setcache r" for most any disk device (certainly all
> SCSI and ATA) to enable the read cache and disable the write cache.

[ shudder ]  I can see the complaints now: "Merely starting up Postgres
cut my overall system performance by a factor of 10!  I wasn't even
using it!!  What a piece of junk!!!"  I can hardly think of a better
way to drive away people with a marginal interest in the database...

This can *not* be default behavior, and unfortunately that limits its
value quite a lot.
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Curt Sampson
Date:
On Wed, 22 Jun 2005, Tom Lane wrote:

> [ shudder ]  I can see the complaints now: "Merely starting up Postgres
> cut my overall system performance by a factor of 10!

Yeah, quite the scenario.

> This can *not* be default behavior, and unfortunately that limits its
> value quite a lot.

Indeed. Maybe it's best just to document this stuff for the various
OSes, and let the admins deal with configuring their machines.

But you know, it might be a reasonable option switch, or something.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org     Make up enjoying your city life...produced
byBIC CAMERA
 


Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
[ on the other point... ]

Curt Sampson <cjs@cynic.net> writes:
> But is it really a problem? I somewhere got the impression that some
> drives, on power failure, will be able to keep going for long enough to
> write out the cache and park the heads anyway. If so, the drive is still
> guaranteeing the write.

If the drives worked that way, we'd not be seeing any problem, but we do
see problems.  Without having a whole lot of data to back it up, I would
think that keeping the platter spinning is no problem (sheer rotational
inertia) but seeking to a lot of new tracks to write randomly-positioned
dirty sectors would require significant energy that just ain't there
once the power drops.  I seem to recall reading that the seek actuators
eat the largest share of power in a running drive...
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Gavin Sherry
Date:
On Thu, 23 Jun 2005, Tom Lane wrote:

> [ on the other point... ]
>
> Curt Sampson <cjs@cynic.net> writes:
> > But is it really a problem? I somewhere got the impression that some
> > drives, on power failure, will be able to keep going for long enough to
> > write out the cache and park the heads anyway. If so, the drive is still
> > guaranteeing the write.
>
> If the drives worked that way, we'd not be seeing any problem, but we do
> see problems.  Without having a whole lot of data to back it up, I would
> think that keeping the platter spinning is no problem (sheer rotational
> inertia) but seeking to a lot of new tracks to write randomly-positioned
> dirty sectors would require significant energy that just ain't there
> once the power drops.  I seem to recall reading that the seek actuators
> eat the largest share of power in a running drive...

I've seen discussion about disks behaving this way. There's no magic:
they're battery backed.

Thanks,

Gavin


Re: [PATCHES] O_DIRECT for WAL writes

From
Gregory Maxwell
Date:
On 6/23/05, Gavin Sherry <swm@linuxworld.com.au> wrote:

> > inertia) but seeking to a lot of new tracks to write randomly-positioned
> > dirty sectors would require significant energy that just ain't there
> > once the power drops.  I seem to recall reading that the seek actuators
> > eat the largest share of power in a running drive...
>
> I've seen discussion about disks behaving this way. There's no magic:
> they're battery backed.

Nah this isn't always the case, for example some of the IBM deskstars
had a few tracks at the start of the disk reserved.. if the power
failed the head retracted all the way and used the rotational energy
to power it long enough to write out the cache..  At start the drive
would read it back in and finish flushing it.

.... unfortunately firmware bugs made it not always wait until the
head returned to the start to begin writing...

I'm not sure what other drives do this (er, well do it correctly :) ).


Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
>> Curt Sampson <cjs@cynic.net> writes:
>>> But is it really a problem? I somewhere got the impression that some
>>> drives, on power failure, will be able to keep going for long enough to
>>> write out the cache and park the heads anyway. If so, the drive is still
>>> guaranteeing the write.

> I've seen discussion about disks behaving this way. There's no magic:
> they're battery backed.

Oh, sure, then it's easy ;-)

The bottom line here seems to be the same as always: you can't run an
industrial strength database on piece-of-junk consumer grade hardware.
Our problem is that because the software is free, people expect to run
it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
they blame us when they don't get the same results as the guy running
Oracle on million-dollar triply-redundant server hardware.  Oh well.
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Gavin Sherry
Date:
On Thu, 23 Jun 2005, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> >> Curt Sampson <cjs@cynic.net> writes:
> >>> But is it really a problem? I somewhere got the impression that some
> >>> drives, on power failure, will be able to keep going for long enough to
> >>> write out the cache and park the heads anyway. If so, the drive is still
> >>> guaranteeing the write.
>
> > I've seen discussion about disks behaving this way. There's no magic:
> > they're battery backed.
>
> Oh, sure, then it's easy ;-)
>
> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.
> Our problem is that because the software is free, people expect to run
> it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
> they blame us when they don't get the same results as the guy running
> Oracle on million-dollar triply-redundant server hardware.  Oh well.

If you ever need a second job, I recommend stand up comedy :-).

Gavin


Re: [PATCHES] O_DIRECT for WAL writes

From
Curt Sampson
Date:
On Thu, 23 Jun 2005, Tom Lane wrote:

> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.

Sure you can, though it may take several bits of piece-of-junk
consumer-grade hardware. It's far more about how you set up your system
and implement recovery policies than it is about hardware.

I ran an ISP back in the '90s on old PC junk, and we had far better
uptime than most of our competitors running on expensive Sun gear. One
ISP was completely out for half a day because the tech. guy bent and
broke a hot-swappable circuit board while installing it, bringing down
the entire machine. (Pretty dumb of them to be running everything on a
single, irreplacable "high-availablity" system.)

> ...they blame us when they don't get the same results as the guy
> running Oracle on...

Now that phrase irritates me a bit. I've been using all this stuff for
a long time (Postgres on and off since QUEL, before SQL was dropped
in instead) and at this point, for the (perhaps slim) majority of
applications, I would say that PostgreSQL is a better database than
Oracle. It requires much, much less effort to get a system and its test
framework up and running under PostgreSQL than it does under Oracle,
PostgreSQL has far fewer stupid limitations, and in other areas, such
as performance, it competes reasonably well in a lot of cases. It's a
pretty impressive piece of work, thanks in large part to efforts put in
over the last few years.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org     Make up enjoying your city life...produced
byBIC CAMERA
 


Re: [PATCHES] O_DIRECT for WAL writes

From
"Jim C. Nasby"
Date:
On Wed, Jun 22, 2005 at 03:50:04PM -0400, Tom Lane wrote:
> The reason I question "automatic" is that you really want to test each
> drive being used, if the system has more than one; but Postgres has no
> idea what the actual hardware layout is, and so no good way to know what
> needs to be tested.

Would testing in the WAL directory be sufficient? Or at least better
than nothing? Of course we could test in the database directories as
well, but you never know if stuff's been symlinked elsewhere... err, we
can test for that, no?

In any case, it seems like it'd be good to try to test and throw a
warning if the drive appears to be caching or if we think the test might
not cover everything (ie symlinks in the data directory).
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [PATCHES] O_DIRECT for WAL writes

From
Douglas McNaught
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:

> Would testing in the WAL directory be sufficient? Or at least better
> than nothing? Of course we could test in the database directories as
> well, but you never know if stuff's been symlinked elsewhere... err, we
> can test for that, no?
>
> In any case, it seems like it'd be good to try to test and throw a
> warning if the drive appears to be caching or if we think the test might
> not cover everything (ie symlinks in the data directory).

I think it would make more sense to write the test as a separate
utility program--then the sysadmin can check the disks he cares
about.  I don't personally see the need to burden the backend with
this.

-Doug


Re: [PATCHES] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> Unfortunately, I cannot believe these numbers --- the near equality of
> >> fsync off and fsync on means there is something very wrong with the
> >> measurements.  What I suspect is that your ATA drives are doing write
> >> caching and thus the "fsyncs" are not really waiting for I/O at all.
> 
> > I wonder whether it would make sense to have an automatic test for this
> > problem. I suspect there are lots of installations out there whose admins
> > don't realize that their hardware is doing this to them.
> 
> Not sure about "automatic", but a simple little test program to measure
> the speed of rewriting/fsyncing a small test file would surely be a nice
> thing to have.
> 
> The reason I question "automatic" is that you really want to test each
> drive being used, if the system has more than one; but Postgres has no
> idea what the actual hardware layout is, and so no good way to know what
> needs to be tested.

Some folks have battery-backed cached controllers so they would appear
as not handling fsync when in fact they do.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> The reason I question "automatic" is that you really want to test each
>> drive being used, if the system has more than one; but Postgres has no
>> idea what the actual hardware layout is, and so no good way to know what
>> needs to be tested.

> Some folks have battery-backed cached controllers so they would appear
> as not handling fsync when in fact they do.

Right, so something like refusing to start if we think fsync doesn't
work is probably not a hot idea.  (Unless you want to provide a GUC
variable to override it...)
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> >> Curt Sampson <cjs@cynic.net> writes:
> >>> But is it really a problem? I somewhere got the impression that some
> >>> drives, on power failure, will be able to keep going for long enough to
> >>> write out the cache and park the heads anyway. If so, the drive is still
> >>> guaranteeing the write.
> 
> > I've seen discussion about disks behaving this way. There's no magic:
> > they're battery backed.
> 
> Oh, sure, then it's easy ;-)
> 
> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.
> Our problem is that because the software is free, people expect to run
> it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
> they blame us when they don't get the same results as the guy running
> Oracle on million-dollar triply-redundant server hardware.  Oh well.

At least we have an FAQ on this:
   <H3><A name="3.7">3.7</A>) What computer hardware should I use?</H3>
   <P>Because PC hardware is mostly compatible, people tend to believe that   all PC hardware is of equal quality.  It
isnot.  ECC RAM, SCSI, and   quality motherboards are more reliable and have better performance than   less expensive
hardware. PostgreSQL will run on almost any hardware,   but if reliability and performance are important it is wise to
research your hardware options thoroughly.  Our email lists can be used   to discuss hardware options and
tradeoffs.</P>



--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] O_DIRECT for WAL writes

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Unfortunately, I cannot believe these numbers --- the near equality of
> fsync off and fsync on means there is something very wrong with the
> measurements.  What I suspect is that your ATA drives are doing write
> caching and thus the "fsyncs" are not really waiting for I/O at all.

I think direct io and writeback-cache should be considered separate issues.
I guess that direct-io can make OSes not to cache WAL files and they will
use more memory to cache data files.

In my previous test, I had enabled writeback-cache of my drives
because of performance. But I understand that the cache should be
disabled for reliable writes from the discussion.
Also my checkpoint_segments setting might be too large against
the default. So I'll post the new results:

checkpoint_ | writeback | 
segments    | cache     | open_sync | fsync=false   | O_DIRECT only | fsync_direct  | open_direct
------------+-----------+-----------+---------------+---------------+---------------+--------------
[1]  48     | on        | 109.3 tps | 125.1(+ 11.4%)| 157.3(+44.0%) | 160.4(+46.8%) | 161.1(+47.5%)
[2]   3     | on        | 102.5 tps | 136.3(+ 33.0%)| 117.6(+14.7%) |               | 
[3]   3     | off       |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 0.9%) |  38.5(+ 0.9%)

- 30runs * pgbench -s 100 -c 10 -t 1000
- using 2 ATA disks:  - hda(reiserfs) includes system and wal. writeback-cache is on at [1][2] and off at [3].  -
hdc(jfs)includes database files. writeback-cache is always on.
 

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories




Re: [PATCHES] O_DIRECT for WAL writes

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> ... So I'll post the new results:

> checkpoint_ | writeback | 
> segments    | cache     | open_sync | fsync=false   | O_DIRECT only | fsync_direct  | open_direct
> ------------+-----------+-----------+---------------+---------------+---------------+--------------
> [3]   3     | off       |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 0.9%) |  38.5(+ 0.9%)

Yeah, this is about what I was afraid of: if you're actually fsyncing
then you get at best one commit per disk revolution, and the negotiation
with the OS is down in the noise.

At this point I'm inclined to reject the patch on the grounds that it
adds complexity and portability issues, without actually buying any
useful performance improvement.  The write-cache-on numbers are not
going to be interesting to any serious user :-(
        regards, tom lane


Re: [PATCHES] O_DIRECT for WAL writes

From
"Jim C. Nasby"
Date:
On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > ... So I'll post the new results:
> 
> > checkpoint_ | writeback | 
> > segments    | cache     | open_sync | fsync=false   | O_DIRECT only | fsync_direct  | open_direct
> > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > [3]   3     | off       |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 0.9%) |  38.5(+ 0.9%)
> 
> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.
> 
> At this point I'm inclined to reject the patch on the grounds that it
> adds complexity and portability issues, without actually buying any
> useful performance improvement.  The write-cache-on numbers are not
> going to be interesting to any serious user :-(

Is there anyone with a battery-backed RAID controller that could run
these tests? I suspect that in that case the differences might be closer
to 1 or 2 rather than 3, which would make the patch much more valuable.

Josh, is this something that could be done in the performance lab?
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [PATCHES] O_DIRECT for WAL writes

From
Josh Berkus
Date:
Jim,

> Josh, is this something that could be done in the performance lab?

That's the idea.   Sadly, OSDL's hardware has been having critical failures of 
late (I'm still trying to get test results on the checkpointing thing) and 
the GreenPlum machines aren't up yet.

I need to contact those folks in Brazil ...

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: [PATCHES] O_DIRECT for WAL writes

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.

If we disable writeback-cache and use open_sync, the per-page writing
behavior in WAL module will show up as bad result. O_DIRECT is similar
to O_DSYNC (at least on linux), so that the benefit of it will disappear
behind the slow disk revolution.

In the current source, WAL is written as:
    for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
Is this intentional? Can we rewrite it as follows?
   write(&buffers[0], N * BLCKSZ);

In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
These two patches are independent, so they can be applied either or both.


I tested them on my machine and the results as follows. It shows that
direct-io and gather-write is the best choice when writeback-cache is off.
Are these two patches worth trying if they are used together?


            | writeback | fsync= | fdata | open_ | fsync_ | open_
patch       | cache     |  false |  sync |  sync | direct | direct
------------+-----------+--------+-------+-------+--------+---------
direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2
direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5
gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A)
both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2

- 20runs * pgbench -s 100 -c 50 -t 200
   - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
- using 2 ATA disks:
   - hda(reiserfs) includes system and wal.
   - hdc(jfs) includes database files. writeback-cache is always on.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


Attachment

Re: [PATCHES] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
These patches will require some refactoring and documentation, but I
will do that when I apply it.

Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
> 
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
> 
> In the current source, WAL is written as:
>     for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
>    write(&buffers[0], N * BLCKSZ);
> 
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
> 
> 
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
> 
> 
>             | writeback | fsync= | fdata | open_ | fsync_ | open_ 
> patch       | cache     |  false |  sync |  sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2 
> direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5 
> gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A) 
> both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2 
> 
> - 20runs * pgbench -s 100 -c 50 -t 200
>    - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
>    - hda(reiserfs) includes system and wal.
>    - hdc(jfs) includes database files. writeback-cache is always on.
> 
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
> 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] O_DIRECT for WAL writes

From
Mark Wong
Date:
On Fri, 24 Jun 2005 09:21:56 -0700
Josh Berkus <josh@agliodbs.com> wrote:

> Jim,
> 
> > Josh, is this something that could be done in the performance lab?
> 
> That's the idea.   Sadly, OSDL's hardware has been having critical failures of 
> late (I'm still trying to get test results on the checkpointing thing) and 
> the GreenPlum machines aren't up yet.

I'm on the verge of having a 4-way opteron system with 4 Adaptec 2200s
scsi controllers attached to eight 10-disk 36GB arrays ready.  I believe
there are software tools that'll let you reconfigure the luns from linux
so you wouldn't need physical access.  Anyone want time on the system?

Mark


Re: [PATCHES] O_DIRECT for WAL writes

From
"Jeffrey W. Baker"
Date:
On Fri, 2005-06-24 at 09:37 -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > ... So I'll post the new results:
> 
> > checkpoint_ | writeback | 
> > segments    | cache     | open_sync | fsync=false   | O_DIRECT only | fsync_direct  | open_direct
> > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > [3]   3     | off       |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 0.9%) |  38.5(+ 0.9%)
> 
> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.
> 
> At this point I'm inclined to reject the patch on the grounds that it
> adds complexity and portability issues, without actually buying any
> useful performance improvement.  The write-cache-on numbers are not
> going to be interesting to any serious user :-(

You mean not interesting to people without a UPS.  Personally, I'd like
to realize a 50% boost in tps, which is what O_DIRECT buys according to
ITAGAKI Takahiro's posted results.

The batteries on a caching RAID controller can run for days at a
stretch.  It's not as dangerous as people make it sound.  And anyone
running PG on software RAID is crazy.

-jwb


Re: [PATCHES] O_DIRECT for WAL writes

From
"Jeffrey W. Baker"
Date:
On Fri, 2005-06-24 at 10:19 -0500, Jim C. Nasby wrote:
> On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
> > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > > ... So I'll post the new results:
> > 
> > > checkpoint_ | writeback | 
> > > segments    | cache     | open_sync | fsync=false   | O_DIRECT only | fsync_direct  | open_direct
> > > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > > [3]   3     | off       |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 0.9%) |  38.5(+ 0.9%)
> > 
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
> > 
> > At this point I'm inclined to reject the patch on the grounds that it
> > adds complexity and portability issues, without actually buying any
> > useful performance improvement.  The write-cache-on numbers are not
> > going to be interesting to any serious user :-(
> 
> Is there anyone with a battery-backed RAID controller that could run
> these tests? I suspect that in that case the differences might be closer
> to 1 or 2 rather than 3, which would make the patch much more valuable.

I applied the O_DIRECT patch to 8.0.3 and I tested this on a
battery-backed RAID controller with 128MB of cache and 5 7200RPM SATA
disks.  All caches are write-back.  The xlog and data are on the same
JFS volume.  pgbench was run with a scale factor of 1000 and 100000
total transactions.  Clients varied from 10 to 100.


Clients  |  fsync   |   open_direct
------------------------------------  10    |    81    |    98 (+21%) 100    |   100    |   105 ( +5%)
------------------------------------

No problems were experienced.  The patch seems to give a useful boost!

-jwb


Re: [PATCHES] O_DIRECT for WAL writes

From
Greg Stark
Date:
"Jeffrey W. Baker" <jwb@gghcwest.com> writes:

> The batteries on a caching RAID controller can run for days at a
> stretch.  It's not as dangerous as people make it sound.  And anyone
> running PG on software RAID is crazy.

Get back to us after your first hardware failure when your vendor says the
power supply you need is on backorder and won't be available for 48 hours...

(And what's your problem with software raid anyways?)

-- 
greg



Re: [PATCHES] O_DIRECT for WAL writes

From
"Joshua D. Drake"
Date:
Greg Stark wrote:
> "Jeffrey W. Baker" <jwb@gghcwest.com> writes:
> 
> 
>>The batteries on a caching RAID controller can run for days at a
>>stretch.  It's not as dangerous as people make it sound.  And anyone
>>running PG on software RAID is crazy.
> 
> 
> Get back to us after your first hardware failure when your vendor says the
> power supply you need is on backorder and won't be available for 48 hours...
> 
> (And what's your problem with software raid anyways?)

I would have to second that. Software raid works just fine.

Sincerely,

Joshua D. Drake


> 


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/