Thread: Re: [PATCHES] O_DIRECT for WAL writes
Hi all, O_DIRECT for WAL writes was discussed at http://archives.postgresql.org/pgsql-patches/2005-06/msg00064.php but I have some items that want to be discussed, so I would like to re-post it to HACKERS. Bruce Momjian <pgman@candle.pha.pa.us> wrote: > I think the conclusion from the discussion is that O_DIRECT is in > addition to the sync method, rather than in place of it, because > O_DIRECT doesn't have the same media write guarantees as fsync(). Would > you update the patch to do and see if there is a performance win? I tested two combinations, - fsync_direct: O_DIRECT+fsync() - open_direct: O_DIRECT+O_SYNC to compare them with O_DIRECT on my linux machine. The pgbench results still shows a performance win: scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct | open_direct -----+--------+-----------+--------------+--------------+--------------+--------------- 10 | 150MB | 252.6 tps | 263.5(+4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%)100 | 1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)|150.8(+46.8%) 60runs * pgbench -c 10 -t 1000 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8 O_DIRECT, fsync_direct and open_direct show the same tendency of performance. There were a win on scale=100, but no win on scale=10, which is a fully in-memory benchmark. The following items still want to be discussed: - Are their names appropriate? Simplify to 'direct'? - Are both fsync_direct and open_direct necessary? MySQL seems to use only O_DIRECT+fsync() combination. - Is it ok to set the dio buffer alignment to BLCKSZ? This is simple way to set the alignment to match many environment. If it is not enough, BLCKSZ would be also a problem for direct io. BTW, IMHO the major benefit of direct io is saving memory. O_DIRECT gives a hint that OS should not cache WAL files. Without direct io, OS might make a effort to cache WAL files, which will never be used, and might discard data file cache. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > I tested two combinations, > - fsync_direct: O_DIRECT+fsync() > - open_direct: O_DIRECT+O_SYNC > to compare them with O_DIRECT on my linux machine. > The pgbench results still shows a performance win: > scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct | open_direct > -----+--------+-----------+--------------+--------------+--------------+--------------- > 10 | 150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%) > 100 | 1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%) > 60runs * pgbench -c 10 -t 1000 > on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8 Unfortunately, I cannot believe these numbers --- the near equality of fsync off and fsync on means there is something very wrong with the measurements. What I suspect is that your ATA drives are doing write caching and thus the "fsyncs" are not really waiting for I/O at all. regards, tom lane
Takahiro, > scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct | > open_direct > -----+--------+-----------+--------------+--------------+--------------+ >--------------- 10 | 150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| > 253.6(+ 0.4%)| 253.3(+ 0.3%) 100 | 1.5GB | 102.7 tps | 117.8(+14.7%)| > 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%) 60runs * pgbench -c 10 -t > 1000 > on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8 This looks pretty good. I'd like to try it out on some of our tests. Will get back to you on this, but it looks to me like the O_DIRECT results are good enough to consider accepting the patch. What filesystem and mount options did you use for this test? > - Are both fsync_direct and open_direct necessary? > MySQL seems to use only O_DIRECT+fsync() combination. MySQL doesn't support as many operating systems as we do. What OSes and versions will support O_DIRECT? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom Lane <tgl@sss.pgh.pa.us> writes: > Unfortunately, I cannot believe these numbers --- the near equality of > fsync off and fsync on means there is something very wrong with the > measurements. What I suspect is that your ATA drives are doing write > caching and thus the "fsyncs" are not really waiting for I/O at all. I wonder whether it would make sense to have an automatic test for this problem. I suspect there are lots of installations out there whose admins don't realize that their hardware is doing this to them. It shouldn't be too hard to test a few hundred or even a few thousand fsyncs and calculate the seek time. If it implies a rotational speed over 15kRPM then you know the drive is lying and the data storage is unreliable. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Unfortunately, I cannot believe these numbers --- the near equality of >> fsync off and fsync on means there is something very wrong with the >> measurements. What I suspect is that your ATA drives are doing write >> caching and thus the "fsyncs" are not really waiting for I/O at all. > I wonder whether it would make sense to have an automatic test for this > problem. I suspect there are lots of installations out there whose admins > don't realize that their hardware is doing this to them. Not sure about "automatic", but a simple little test program to measure the speed of rewriting/fsyncing a small test file would surely be a nice thing to have. The reason I question "automatic" is that you really want to test each drive being used, if the system has more than one; but Postgres has no idea what the actual hardware layout is, and so no good way to know what needs to be tested. regards, tom lane
On Thu, 22 Jun 2005, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Unfortunately, I cannot believe these numbers --- the near equality of >> fsync off and fsync on means there is something very wrong with the >> measurements. What I suspect is that your ATA drives are doing write >> caching and thus the "fsyncs" are not really waiting for I/O at all. > > I wonder whether it would make sense to have an automatic test for this > problem. I suspect there are lots of installations out there whose admins > don't realize that their hardware is doing this to them. But is it really a problem? I somewhere got the impression that some drives, on power failure, will be able to keep going for long enough to write out the cache and park the heads anyway. If so, the drive is still guaranteeing the write. But regardless, perhaps we can add some stuff to the various OSes' startup scripts that could help with this. For example, in NetBSD you can "dkctl <device> setcache r" for most any disk device (certainly all SCSI and ATA) to enable the read cache and disable the write cache. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced byBIC CAMERA
Curt Sampson <cjs@cynic.net> writes: > But regardless, perhaps we can add some stuff to the various OSes' > startup scripts that could help with this. For example, in NetBSD you > can "dkctl <device> setcache r" for most any disk device (certainly all > SCSI and ATA) to enable the read cache and disable the write cache. [ shudder ] I can see the complaints now: "Merely starting up Postgres cut my overall system performance by a factor of 10! I wasn't even using it!! What a piece of junk!!!" I can hardly think of a better way to drive away people with a marginal interest in the database... This can *not* be default behavior, and unfortunately that limits its value quite a lot. regards, tom lane
On Wed, 22 Jun 2005, Tom Lane wrote: > [ shudder ] I can see the complaints now: "Merely starting up Postgres > cut my overall system performance by a factor of 10! Yeah, quite the scenario. > This can *not* be default behavior, and unfortunately that limits its > value quite a lot. Indeed. Maybe it's best just to document this stuff for the various OSes, and let the admins deal with configuring their machines. But you know, it might be a reasonable option switch, or something. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced byBIC CAMERA
[ on the other point... ] Curt Sampson <cjs@cynic.net> writes: > But is it really a problem? I somewhere got the impression that some > drives, on power failure, will be able to keep going for long enough to > write out the cache and park the heads anyway. If so, the drive is still > guaranteeing the write. If the drives worked that way, we'd not be seeing any problem, but we do see problems. Without having a whole lot of data to back it up, I would think that keeping the platter spinning is no problem (sheer rotational inertia) but seeking to a lot of new tracks to write randomly-positioned dirty sectors would require significant energy that just ain't there once the power drops. I seem to recall reading that the seek actuators eat the largest share of power in a running drive... regards, tom lane
On Thu, 23 Jun 2005, Tom Lane wrote: > [ on the other point... ] > > Curt Sampson <cjs@cynic.net> writes: > > But is it really a problem? I somewhere got the impression that some > > drives, on power failure, will be able to keep going for long enough to > > write out the cache and park the heads anyway. If so, the drive is still > > guaranteeing the write. > > If the drives worked that way, we'd not be seeing any problem, but we do > see problems. Without having a whole lot of data to back it up, I would > think that keeping the platter spinning is no problem (sheer rotational > inertia) but seeking to a lot of new tracks to write randomly-positioned > dirty sectors would require significant energy that just ain't there > once the power drops. I seem to recall reading that the seek actuators > eat the largest share of power in a running drive... I've seen discussion about disks behaving this way. There's no magic: they're battery backed. Thanks, Gavin
On 6/23/05, Gavin Sherry <swm@linuxworld.com.au> wrote: > > inertia) but seeking to a lot of new tracks to write randomly-positioned > > dirty sectors would require significant energy that just ain't there > > once the power drops. I seem to recall reading that the seek actuators > > eat the largest share of power in a running drive... > > I've seen discussion about disks behaving this way. There's no magic: > they're battery backed. Nah this isn't always the case, for example some of the IBM deskstars had a few tracks at the start of the disk reserved.. if the power failed the head retracted all the way and used the rotational energy to power it long enough to write out the cache.. At start the drive would read it back in and finish flushing it. .... unfortunately firmware bugs made it not always wait until the head returned to the start to begin writing... I'm not sure what other drives do this (er, well do it correctly :) ).
Gavin Sherry <swm@linuxworld.com.au> writes: >> Curt Sampson <cjs@cynic.net> writes: >>> But is it really a problem? I somewhere got the impression that some >>> drives, on power failure, will be able to keep going for long enough to >>> write out the cache and park the heads anyway. If so, the drive is still >>> guaranteeing the write. > I've seen discussion about disks behaving this way. There's no magic: > they're battery backed. Oh, sure, then it's easy ;-) The bottom line here seems to be the same as always: you can't run an industrial strength database on piece-of-junk consumer grade hardware. Our problem is that because the software is free, people expect to run it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then they blame us when they don't get the same results as the guy running Oracle on million-dollar triply-redundant server hardware. Oh well. regards, tom lane
On Thu, 23 Jun 2005, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > >> Curt Sampson <cjs@cynic.net> writes: > >>> But is it really a problem? I somewhere got the impression that some > >>> drives, on power failure, will be able to keep going for long enough to > >>> write out the cache and park the heads anyway. If so, the drive is still > >>> guaranteeing the write. > > > I've seen discussion about disks behaving this way. There's no magic: > > they're battery backed. > > Oh, sure, then it's easy ;-) > > The bottom line here seems to be the same as always: you can't run an > industrial strength database on piece-of-junk consumer grade hardware. > Our problem is that because the software is free, people expect to run > it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then > they blame us when they don't get the same results as the guy running > Oracle on million-dollar triply-redundant server hardware. Oh well. If you ever need a second job, I recommend stand up comedy :-). Gavin
On Thu, 23 Jun 2005, Tom Lane wrote: > The bottom line here seems to be the same as always: you can't run an > industrial strength database on piece-of-junk consumer grade hardware. Sure you can, though it may take several bits of piece-of-junk consumer-grade hardware. It's far more about how you set up your system and implement recovery policies than it is about hardware. I ran an ISP back in the '90s on old PC junk, and we had far better uptime than most of our competitors running on expensive Sun gear. One ISP was completely out for half a day because the tech. guy bent and broke a hot-swappable circuit board while installing it, bringing down the entire machine. (Pretty dumb of them to be running everything on a single, irreplacable "high-availablity" system.) > ...they blame us when they don't get the same results as the guy > running Oracle on... Now that phrase irritates me a bit. I've been using all this stuff for a long time (Postgres on and off since QUEL, before SQL was dropped in instead) and at this point, for the (perhaps slim) majority of applications, I would say that PostgreSQL is a better database than Oracle. It requires much, much less effort to get a system and its test framework up and running under PostgreSQL than it does under Oracle, PostgreSQL has far fewer stupid limitations, and in other areas, such as performance, it competes reasonably well in a lot of cases. It's a pretty impressive piece of work, thanks in large part to efforts put in over the last few years. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced byBIC CAMERA
On Wed, Jun 22, 2005 at 03:50:04PM -0400, Tom Lane wrote: > The reason I question "automatic" is that you really want to test each > drive being used, if the system has more than one; but Postgres has no > idea what the actual hardware layout is, and so no good way to know what > needs to be tested. Would testing in the WAL directory be sufficient? Or at least better than nothing? Of course we could test in the database directories as well, but you never know if stuff's been symlinked elsewhere... err, we can test for that, no? In any case, it seems like it'd be good to try to test and throw a warning if the drive appears to be caching or if we think the test might not cover everything (ie symlinks in the data directory). -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > Would testing in the WAL directory be sufficient? Or at least better > than nothing? Of course we could test in the database directories as > well, but you never know if stuff's been symlinked elsewhere... err, we > can test for that, no? > > In any case, it seems like it'd be good to try to test and throw a > warning if the drive appears to be caching or if we think the test might > not cover everything (ie symlinks in the data directory). I think it would make more sense to write the test as a separate utility program--then the sysadmin can check the disks he cares about. I don't personally see the need to burden the backend with this. -Doug
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Unfortunately, I cannot believe these numbers --- the near equality of > >> fsync off and fsync on means there is something very wrong with the > >> measurements. What I suspect is that your ATA drives are doing write > >> caching and thus the "fsyncs" are not really waiting for I/O at all. > > > I wonder whether it would make sense to have an automatic test for this > > problem. I suspect there are lots of installations out there whose admins > > don't realize that their hardware is doing this to them. > > Not sure about "automatic", but a simple little test program to measure > the speed of rewriting/fsyncing a small test file would surely be a nice > thing to have. > > The reason I question "automatic" is that you really want to test each > drive being used, if the system has more than one; but Postgres has no > idea what the actual hardware layout is, and so no good way to know what > needs to be tested. Some folks have battery-backed cached controllers so they would appear as not handling fsync when in fact they do. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> The reason I question "automatic" is that you really want to test each >> drive being used, if the system has more than one; but Postgres has no >> idea what the actual hardware layout is, and so no good way to know what >> needs to be tested. > Some folks have battery-backed cached controllers so they would appear > as not handling fsync when in fact they do. Right, so something like refusing to start if we think fsync doesn't work is probably not a hot idea. (Unless you want to provide a GUC variable to override it...) regards, tom lane
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > >> Curt Sampson <cjs@cynic.net> writes: > >>> But is it really a problem? I somewhere got the impression that some > >>> drives, on power failure, will be able to keep going for long enough to > >>> write out the cache and park the heads anyway. If so, the drive is still > >>> guaranteeing the write. > > > I've seen discussion about disks behaving this way. There's no magic: > > they're battery backed. > > Oh, sure, then it's easy ;-) > > The bottom line here seems to be the same as always: you can't run an > industrial strength database on piece-of-junk consumer grade hardware. > Our problem is that because the software is free, people expect to run > it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then > they blame us when they don't get the same results as the guy running > Oracle on million-dollar triply-redundant server hardware. Oh well. At least we have an FAQ on this: <H3><A name="3.7">3.7</A>) What computer hardware should I use?</H3> <P>Because PC hardware is mostly compatible, people tend to believe that all PC hardware is of equal quality. It isnot. ECC RAM, SCSI, and quality motherboards are more reliable and have better performance than less expensive hardware. PostgreSQL will run on almost any hardware, but if reliability and performance are important it is wise to research your hardware options thoroughly. Our email lists can be used to discuss hardware options and tradeoffs.</P> -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Unfortunately, I cannot believe these numbers --- the near equality of > fsync off and fsync on means there is something very wrong with the > measurements. What I suspect is that your ATA drives are doing write > caching and thus the "fsyncs" are not really waiting for I/O at all. I think direct io and writeback-cache should be considered separate issues. I guess that direct-io can make OSes not to cache WAL files and they will use more memory to cache data files. In my previous test, I had enabled writeback-cache of my drives because of performance. But I understand that the cache should be disabled for reliable writes from the discussion. Also my checkpoint_segments setting might be too large against the default. So I'll post the new results: checkpoint_ | writeback | segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct ------------+-----------+-----------+---------------+---------------+---------------+-------------- [1] 48 | on | 109.3 tps | 125.1(+ 11.4%)| 157.3(+44.0%) | 160.4(+46.8%) | 161.1(+47.5%) [2] 3 | on | 102.5 tps | 136.3(+ 33.0%)| 117.6(+14.7%) | | [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) - 30runs * pgbench -s 100 -c 10 -t 1000 - using 2 ATA disks: - hda(reiserfs) includes system and wal. writeback-cache is on at [1][2] and off at [3]. - hdc(jfs)includes database files. writeback-cache is always on. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > ... So I'll post the new results: > checkpoint_ | writeback | > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct > ------------+-----------+-----------+---------------+---------------+---------------+-------------- > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) Yeah, this is about what I was afraid of: if you're actually fsyncing then you get at best one commit per disk revolution, and the negotiation with the OS is down in the noise. At this point I'm inclined to reject the patch on the grounds that it adds complexity and portability issues, without actually buying any useful performance improvement. The write-cache-on numbers are not going to be interesting to any serious user :-( regards, tom lane
On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > ... So I'll post the new results: > > > checkpoint_ | writeback | > > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct > > ------------+-----------+-----------+---------------+---------------+---------------+-------------- > > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) > > Yeah, this is about what I was afraid of: if you're actually fsyncing > then you get at best one commit per disk revolution, and the negotiation > with the OS is down in the noise. > > At this point I'm inclined to reject the patch on the grounds that it > adds complexity and portability issues, without actually buying any > useful performance improvement. The write-cache-on numbers are not > going to be interesting to any serious user :-( Is there anyone with a battery-backed RAID controller that could run these tests? I suspect that in that case the differences might be closer to 1 or 2 rather than 3, which would make the patch much more valuable. Josh, is this something that could be done in the performance lab? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim, > Josh, is this something that could be done in the performance lab? That's the idea. Sadly, OSDL's hardware has been having critical failures of late (I'm still trying to get test results on the checkpointing thing) and the GreenPlum machines aren't up yet. I need to contact those folks in Brazil ... -- Josh Berkus Aglio Database Solutions San Francisco
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, this is about what I was afraid of: if you're actually fsyncing > then you get at best one commit per disk revolution, and the negotiation > with the OS is down in the noise. If we disable writeback-cache and use open_sync, the per-page writing behavior in WAL module will show up as bad result. O_DIRECT is similar to O_DSYNC (at least on linux), so that the benefit of it will disappear behind the slow disk revolution. In the current source, WAL is written as: for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); } Is this intentional? Can we rewrite it as follows? write(&buffers[0], N * BLCKSZ); In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff). Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff). These two patches are independent, so they can be applied either or both. I tested them on my machine and the results as follows. It shows that direct-io and gather-write is the best choice when writeback-cache is off. Are these two patches worth trying if they are used together? | writeback | fsync= | fdata | open_ | fsync_ | open_ patch | cache | false | sync | sync | direct | direct ------------+-----------+--------+-------+-------+--------+--------- direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2 direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5 gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A) both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2 - 20runs * pgbench -s 100 -c 50 -t 200 - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8) - using 2 ATA disks: - hda(reiserfs) includes system and wal. - hdc(jfs) includes database files. writeback-cache is always on. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
Attachment
These patches will require some refactoring and documentation, but I will do that when I apply it. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Yeah, this is about what I was afraid of: if you're actually fsyncing > > then you get at best one commit per disk revolution, and the negotiation > > with the OS is down in the noise. > > If we disable writeback-cache and use open_sync, the per-page writing > behavior in WAL module will show up as bad result. O_DIRECT is similar > to O_DSYNC (at least on linux), so that the benefit of it will disappear > behind the slow disk revolution. > > In the current source, WAL is written as: > for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); } > Is this intentional? Can we rewrite it as follows? > write(&buffers[0], N * BLCKSZ); > > In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff). > Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff). > These two patches are independent, so they can be applied either or both. > > > I tested them on my machine and the results as follows. It shows that > direct-io and gather-write is the best choice when writeback-cache is off. > Are these two patches worth trying if they are used together? > > > | writeback | fsync= | fdata | open_ | fsync_ | open_ > patch | cache | false | sync | sync | direct | direct > ------------+-----------+--------+-------+-------+--------+--------- > direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2 > direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5 > gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A) > both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2 > > - 20runs * pgbench -s 100 -c 50 -t 200 > - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8) > - using 2 ATA disks: > - hda(reiserfs) includes system and wal. > - hdc(jfs) includes database files. writeback-cache is always on. > > --- > ITAGAKI Takahiro > NTT Cyber Space Laboratories > [ Attachment, skipping... ] [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 24 Jun 2005 09:21:56 -0700 Josh Berkus <josh@agliodbs.com> wrote: > Jim, > > > Josh, is this something that could be done in the performance lab? > > That's the idea. Sadly, OSDL's hardware has been having critical failures of > late (I'm still trying to get test results on the checkpointing thing) and > the GreenPlum machines aren't up yet. I'm on the verge of having a 4-way opteron system with 4 Adaptec 2200s scsi controllers attached to eight 10-disk 36GB arrays ready. I believe there are software tools that'll let you reconfigure the luns from linux so you wouldn't need physical access. Anyone want time on the system? Mark
On Fri, 2005-06-24 at 09:37 -0400, Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > ... So I'll post the new results: > > > checkpoint_ | writeback | > > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct > > ------------+-----------+-----------+---------------+---------------+---------------+-------------- > > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) > > Yeah, this is about what I was afraid of: if you're actually fsyncing > then you get at best one commit per disk revolution, and the negotiation > with the OS is down in the noise. > > At this point I'm inclined to reject the patch on the grounds that it > adds complexity and portability issues, without actually buying any > useful performance improvement. The write-cache-on numbers are not > going to be interesting to any serious user :-( You mean not interesting to people without a UPS. Personally, I'd like to realize a 50% boost in tps, which is what O_DIRECT buys according to ITAGAKI Takahiro's posted results. The batteries on a caching RAID controller can run for days at a stretch. It's not as dangerous as people make it sound. And anyone running PG on software RAID is crazy. -jwb
On Fri, 2005-06-24 at 10:19 -0500, Jim C. Nasby wrote: > On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote: > > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > > ... So I'll post the new results: > > > > > checkpoint_ | writeback | > > > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct > > > ------------+-----------+-----------+---------------+---------------+---------------+-------------- > > > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) > > > > Yeah, this is about what I was afraid of: if you're actually fsyncing > > then you get at best one commit per disk revolution, and the negotiation > > with the OS is down in the noise. > > > > At this point I'm inclined to reject the patch on the grounds that it > > adds complexity and portability issues, without actually buying any > > useful performance improvement. The write-cache-on numbers are not > > going to be interesting to any serious user :-( > > Is there anyone with a battery-backed RAID controller that could run > these tests? I suspect that in that case the differences might be closer > to 1 or 2 rather than 3, which would make the patch much more valuable. I applied the O_DIRECT patch to 8.0.3 and I tested this on a battery-backed RAID controller with 128MB of cache and 5 7200RPM SATA disks. All caches are write-back. The xlog and data are on the same JFS volume. pgbench was run with a scale factor of 1000 and 100000 total transactions. Clients varied from 10 to 100. Clients | fsync | open_direct ------------------------------------ 10 | 81 | 98 (+21%) 100 | 100 | 105 ( +5%) ------------------------------------ No problems were experienced. The patch seems to give a useful boost! -jwb
"Jeffrey W. Baker" <jwb@gghcwest.com> writes: > The batteries on a caching RAID controller can run for days at a > stretch. It's not as dangerous as people make it sound. And anyone > running PG on software RAID is crazy. Get back to us after your first hardware failure when your vendor says the power supply you need is on backorder and won't be available for 48 hours... (And what's your problem with software raid anyways?) -- greg
Greg Stark wrote: > "Jeffrey W. Baker" <jwb@gghcwest.com> writes: > > >>The batteries on a caching RAID controller can run for days at a >>stretch. It's not as dangerous as people make it sound. And anyone >>running PG on software RAID is crazy. > > > Get back to us after your first hardware failure when your vendor says the > power supply you need is on backorder and won't be available for 48 hours... > > (And what's your problem with software raid anyways?) I would have to second that. Software raid works just fine. Sincerely, Joshua D. Drake > -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/