Thread: further testing on IDE drives
I was testing to get some idea of how to speed up the speed of pgbench with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 /dev/hdx). The only parameter that seems to make a noticeable difference was setting wal_sync_method = open_sync. With it set to either fsync, or fdatasync, the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync it jumped to the range of 45 to 52 tps. with write cache on I was getting 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only about 5 times slower, much better. Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord and see if the data gets corrupted with write caching turned on, i.e. do my hard drives have the ability to write at least some of their cache during spin down.
On Thu, 2 Oct 2003, scott.marlowe wrote: > I was testing to get some idea of how to speed up the speed of pgbench > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > /dev/hdx). > > The only parameter that seems to make a noticeable difference was setting > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > it jumped to the range of 45 to 52 tps. with write cache on I was getting > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > about 5 times slower, much better. > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > and see if the data gets corrupted with write caching turned on, i.e. do > my hard drives have the ability to write at least some of their cache > during spin down. OK, back from testing. Information: Dual PIV system with a pair of 80 gig IDE drives, model number: ST380023A (seagate). File system is ext3 and is on a seperate drive from the OS. These drives DO NOT write cache when they lose power. Testing was done by issuing a 'hdparm -W0/1 /dev/hdx' command where x is the real drive letter, and 0 or 1 was chosen in place of 0/1. Then I'd issue a 'pgbench -c 50 -t 100000000' command, wait for a few minutes, then pull the power cord. I'm running RH linux 9.0 stock install, kernel: 2.4.20-8smp. Three times pulling the plug with 'hdparm -W0 /dev/hdx' resulted in a machine that would boot up, recover with journal, and a database that came up within about 30 seconds, with all the accounts still intact. Switching the caching back on with 'hdparm -W1 /dev/hdx' and doing the same 'pgbench -c 50 -t 100000000' resulted in a corrupted database each time. Also, I tried each of the following fsync methods: fsync, fdatasync, and open_sync with write caching turned off. Each survived a power off test with no corruption of the database. fsync and fdatasync result in 11 to 17 tps with 'pgbench -c 5 -t 500' while open_sync resulted in 45 to 55 tps, as mentioned in the previous post. I'd be interested in hearing from other folks which sync method works for them and whether or not there are any IDE drives out there that can write their cache to the platters on power off when caching is enabled.
scott.marlowe wrote: > I was testing to get some idea of how to speed up the speed of pgbench > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > /dev/hdx). > > The only parameter that seems to make a noticeable difference was setting > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > it jumped to the range of 45 to 52 tps. with write cache on I was getting > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > about 5 times slower, much better. > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > and see if the data gets corrupted with write caching turned on, i.e. do > my hard drives have the ability to write at least some of their cache > during spin down. Is this a reason we should switch to open_sync as a default, if it is availble, rather than fsync? I think we are doing a single write before fsync a lot more often than we are doing multiple writes before fsync. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
How did this drive come by default? Write-cache disabled? --------------------------------------------------------------------------- scott.marlowe wrote: > On Thu, 2 Oct 2003, scott.marlowe wrote: > > > I was testing to get some idea of how to speed up the speed of pgbench > > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > > /dev/hdx). > > > > The only parameter that seems to make a noticeable difference was setting > > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > > it jumped to the range of 45 to 52 tps. with write cache on I was getting > > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > > about 5 times slower, much better. > > > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > > and see if the data gets corrupted with write caching turned on, i.e. do > > my hard drives have the ability to write at least some of their cache > > during spin down. > > OK, back from testing. > > Information: Dual PIV system with a pair of 80 gig IDE drives, model > number: ST380023A (seagate). File system is ext3 and is on a seperate > drive from the OS. > > These drives DO NOT write cache when they lose power. Testing was done by > issuing a 'hdparm -W0/1 /dev/hdx' command where x is the real drive > letter, and 0 or 1 was chosen in place of 0/1. Then I'd issue a 'pgbench > -c 50 -t 100000000' command, wait for a few minutes, then pull the power > cord. > > I'm running RH linux 9.0 stock install, kernel: 2.4.20-8smp. > > Three times pulling the plug with 'hdparm -W0 /dev/hdx' resulted in a > machine that would boot up, recover with journal, and a database that came > up within about 30 seconds, with all the accounts still intact. > > Switching the caching back on with 'hdparm -W1 /dev/hdx' and doing the > same 'pgbench -c 50 -t 100000000' resulted in a corrupted database each > time. > > Also, I tried each of the following fsync methods: fsync, fdatasync, and > open_sync with write caching turned off. Each survived a power off test > with no corruption of the database. fsync and fdatasync result in 11 to > 17 tps with 'pgbench -c 5 -t 500' while open_sync resulted in 45 to 55 > tps, as mentioned in the previous post. > > I'd be interested in hearing from other folks which sync method works > for them and whether or not there are any IDE drives out there that can > write their cache to the platters on power off when caching is enabled. > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Nope, write-cache enabled by default. On Thu, 9 Oct 2003, Bruce Momjian wrote: > > How did this drive come by default? Write-cache disabled? > > --------------------------------------------------------------------------- > > scott.marlowe wrote: > > On Thu, 2 Oct 2003, scott.marlowe wrote: > > > > > I was testing to get some idea of how to speed up the speed of pgbench > > > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > > > /dev/hdx). > > > > > > The only parameter that seems to make a noticeable difference was setting > > > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > > > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > > > it jumped to the range of 45 to 52 tps. with write cache on I was getting > > > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > > > about 5 times slower, much better. > > > > > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > > > and see if the data gets corrupted with write caching turned on, i.e. do > > > my hard drives have the ability to write at least some of their cache > > > during spin down. > > > > OK, back from testing. > > > > Information: Dual PIV system with a pair of 80 gig IDE drives, model > > number: ST380023A (seagate). File system is ext3 and is on a seperate > > drive from the OS. > > > > These drives DO NOT write cache when they lose power. Testing was done by > > issuing a 'hdparm -W0/1 /dev/hdx' command where x is the real drive > > letter, and 0 or 1 was chosen in place of 0/1. Then I'd issue a 'pgbench > > -c 50 -t 100000000' command, wait for a few minutes, then pull the power > > cord. > > > > I'm running RH linux 9.0 stock install, kernel: 2.4.20-8smp. > > > > Three times pulling the plug with 'hdparm -W0 /dev/hdx' resulted in a > > machine that would boot up, recover with journal, and a database that came > > up within about 30 seconds, with all the accounts still intact. > > > > Switching the caching back on with 'hdparm -W1 /dev/hdx' and doing the > > same 'pgbench -c 50 -t 100000000' resulted in a corrupted database each > > time. > > > > Also, I tried each of the following fsync methods: fsync, fdatasync, and > > open_sync with write caching turned off. Each survived a power off test > > with no corruption of the database. fsync and fdatasync result in 11 to > > 17 tps with 'pgbench -c 5 -t 500' while open_sync resulted in 45 to 55 > > tps, as mentioned in the previous post. > > > > I'd be interested in hearing from other folks which sync method works > > for them and whether or not there are any IDE drives out there that can > > write their cache to the platters on power off when caching is enabled. > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > >
On Thu, 9 Oct 2003, Bruce Momjian wrote: > scott.marlowe wrote: > > I was testing to get some idea of how to speed up the speed of pgbench > > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > > /dev/hdx). > > > > The only parameter that seems to make a noticeable difference was setting > > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > > it jumped to the range of 45 to 52 tps. with write cache on I was getting > > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > > about 5 times slower, much better. > > > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > > and see if the data gets corrupted with write caching turned on, i.e. do > > my hard drives have the ability to write at least some of their cache > > during spin down. > > Is this a reason we should switch to open_sync as a default, if it is > availble, rather than fsync? I think we are doing a single write before > fsync a lot more often than we are doing multiple writes before fsync. Sounds reasonable to me. Are there many / any scenarios where a plain fsync would be faster than open_sync?
scott.marlowe wrote: > On Thu, 9 Oct 2003, Bruce Momjian wrote: > > > scott.marlowe wrote: > > > I was testing to get some idea of how to speed up the speed of pgbench > > > with IDE drives and the write caching turned off in Linux (i.e. hdparm -W0 > > > /dev/hdx). > > > > > > The only parameter that seems to make a noticeable difference was setting > > > wal_sync_method = open_sync. With it set to either fsync, or fdatasync, > > > the speed with pgbench -c 5 -t 1000 ran from 11 to 17 tps. With open_sync > > > it jumped to the range of 45 to 52 tps. with write cache on I was getting > > > 280 to 320 tps. so, not instead of being 20 to 30 times slower, I'm only > > > about 5 times slower, much better. > > > > > > Now I'm off to start a "pgbench -c 10 -t 10000" and pull the power cord > > > and see if the data gets corrupted with write caching turned on, i.e. do > > > my hard drives have the ability to write at least some of their cache > > > during spin down. > > > > Is this a reason we should switch to open_sync as a default, if it is > > availble, rather than fsync? I think we are doing a single write before > > fsync a lot more often than we are doing multiple writes before fsync. > > Sounds reasonable to me. Are there many / any scenarios where a plain > fsync would be faster than open_sync? Yes. If you were doing multiple WAL writes before transaction fsync, you would be fsyncing every write, rather than doing two writes and fsync'ing them both. I wonder if larger transactions would find open_sync slower? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce, > Yes. If you were doing multiple WAL writes before transaction fsync, > you would be fsyncing every write, rather than doing two writes and > fsync'ing them both. I wonder if larger transactions would find > open_sync slower? Want me to test? I've got an ide-based test machine here, and the TPCC databases. -- Josh Berkus Aglio Database Solutions San Francisco
On Fri, 10 Oct 2003, Josh Berkus wrote: > Bruce, > > > Yes. If you were doing multiple WAL writes before transaction fsync, > > you would be fsyncing every write, rather than doing two writes and > > fsync'ing them both. I wonder if larger transactions would find > > open_sync slower? > > Want me to test? I've got an ide-based test machine here, and the TPCC > databases. Just make sure the drive's write cache is disabled.
Josh Berkus wrote: > Bruce, > > > Yes. If you were doing multiple WAL writes before transaction fsync, > > you would be fsyncing every write, rather than doing two writes and > > fsync'ing them both. I wonder if larger transactions would find > > open_sync slower? > > Want me to test? I've got an ide-based test machine here, and the TPCC > databases. I would be interested to see if wal_sync_method = fsync is slower than wal_sync_method = open_sync. How often are we doing more then one write before a fsync anyway? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce, > I would be interested to see if wal_sync_method = fsync is slower than > wal_sync_method = open_sync. How often are we doing more then one write > before a fsync anyway? OK. I'll see if I can get to it around my other stuff I have to do this weekend. -- Josh Berkus Aglio Database Solutions San Francisco
>>>>> "BM" == Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Sounds reasonable to me. Are there many / any scenarios where a plain >> fsync would be faster than open_sync? BM> Yes. If you were doing multiple WAL writes before transaction fsync, BM> you would be fsyncing every write, rather than doing two writes and BM> fsync'ing them both. I wonder if larger transactions would find BM> open_sync slower? consider loading a large database from a backup dump. one big transaction during the COPY. I don't know the implications it has on this scenario, though. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
Vivek Khera wrote: > >>>>> "BM" == Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> Sounds reasonable to me. Are there many / any scenarios where a plain > >> fsync would be faster than open_sync? > > BM> Yes. If you were doing multiple WAL writes before transaction fsync, > BM> you would be fsyncing every write, rather than doing two writes and > BM> fsync'ing them both. I wonder if larger transactions would find > BM> open_sync slower? > > consider loading a large database from a backup dump. one big > transaction during the COPY. I don't know the implications it has on > this scenario, though. COPY only does fsync on COPY completion, so I am not sure there are enough fsync's there to make a difference. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 10 Oct 2003, Josh Berkus wrote: > Bruce, > > > Yes. If you were doing multiple WAL writes before transaction fsync, > > you would be fsyncing every write, rather than doing two writes and > > fsync'ing them both. I wonder if larger transactions would find > > open_sync slower? > > Want me to test? I've got an ide-based test machine here, and the TPCC > databases. OK, I decided to do a quick dirty test of things that are big transactions in each mode my kernel supports. I did this: createdb dbname time pg_dump -O -h otherserver dbname|psql dbname then I would drop the db, edit postgresql.conf, and restart the server. open_sync was WAY faster at this than the other two methods. open_sync: 1st run: real 11m27.107s user 0m26.570s sys 0m1.150s 2nd run: real 6m5.712s user 0m26.700s sys 0m1.700s fsync: 1st run: real 15m8.127s user 0m26.710s sys 0m0.990s 2nd run: real 15m8.396s user 0m26.990s sys 0m1.870s fdatasync: 1st run: real 15m47.878s user 0m26.570s sys 0m1.480s 2nd run: real 15m9.402s user 0m27.000s sys 0m1.660s I did the first runs in order, then started over, i.e. opensync run1, fsync run1, fdatasync run1, opensync run2, etc... The machine I was restoring to was under no other load. The machine I was reading from had little or no load, but is a production server, so it's possible the load there could have had a small effect, but probably not this big of a one. The machine this is one is setup so that the data partition is on a drive with write cache enabled, but the pg_xlog and pg_clog directories are on a drive with write cache disabled. Same drive models as listed before in my previous test, Seagate generic 80gig IDE drives, model ST380023A.
>>>>> "BM" == Bruce Momjian <pgman@candle.pha.pa.us> writes: BM> COPY only does fsync on COPY completion, so I am not sure there are BM> enough fsync's there to make a difference. Perhaps then it is part of the indexing that takes so much time with the WAL. When I applied Marc's WAL disabling patch, it shaved nearly 50 minutes off of a 4-hour restore. I sent to Tom the logs from the restores since he was interested in figuring out where the time was saved.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > open_sync was WAY faster at this than the other two methods. Do you not have open_datasync? That's the preferred method if available. regards, tom lane
On Tue, 14 Oct 2003, Tom Lane wrote: > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > open_sync was WAY faster at this than the other two methods. > > Do you not have open_datasync? That's the preferred method if > available. Nope, when I try to start postgresql with it set to that, I get this error message: FATAL: invalid value for "wal_sync_method": "open_datasync" This is on RedHat 9, but I have the same problem on a RH 7.2 box as well.
Bruce Momjian wrote: > Yes. If you were doing multiple WAL writes before transaction fsync, > you would be fsyncing every write, rather than doing two writes and > fsync'ing them both. I wonder if larger transactions would find > open_sync slower? No hard numbers, but I remember testing fsync vs open_sync something ago on 7.3.x. open_sync was blazingly fast for pgbench, but for when we switched our development database over to open_sync, things slowed to a crawl. This was some months ago, and I might be wrong, so take it with a grain of salt. It was on Red Hat 8's Linux kernel 2.4.18, I think. YMMV. Will be testing it real soon tonight, if possible.
Attachment
I have updated my hardware performance documentation to reflect the findings during the past few months on the performance list: http://candle.pha.pa.us/main/writings/pgsql/hw_performance/index.html Thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073