Thread: Raid 10 chunksize
I'm trying to pin down some performance issues with a machine where I work, we are seeing (read only) query response times blow out by an order of magnitude or more at busy times. Initially we blamed autovacuum, but after a tweak of the cost_delay it is *not* the problem. Then I looked at checkpoints... and altho there was some correlation with them and the query response - I'm thinking that the raid chunksize may well be the issue. Fortunately there is an identical DR box, so I could do a little testing. Details follow: Sun 4140 2x quad-core opteron 2356 16G RAM, 6x 15K 140G SAS Debian Lenny Pg 8.3.6 The disk is laid out using software (md) raid: 4 drives raid 10 *4K* chunksize with database files (ext3 ordered, noatime) 2 drives raid 1 with database transaction logs (ext3 ordered, noatime) The relevant non default .conf params are: shared_buffers = 2048MB work_mem = 4MB maintenance_work_mem = 1024MB max_fsm_pages = 153600 bgwriter_lru_maxpages = 200 wal_buffers = 2MB checkpoint_segments = 32 effective_cache_size = 4096MB autovacuum_vacuum_scale_factor = 0.1 autovacuum_vacuum_cost_delay = 60 # This is high, but seemed to help... I've run pgbench: transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 655.335102 (including connections establishing) tps = 655.423232 (excluding connections establishing) Looking at iostat while it is running shows (note sda-sdd raid10, sde and sdf raid 1): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 56.80 0.00 579.00 0.00 2.47 8.74 133.76 235.10 1.73 100.00 sdb 0.00 45.60 0.00 583.60 0.00 2.45 8.59 52.65 90.03 1.71 100.00 sdc 0.00 49.00 0.00 579.80 0.00 2.45 8.66 72.56 125.09 1.72 100.00 sdd 0.00 58.40 0.00 565.00 0.00 2.42 8.79 135.31 235.52 1.77 100.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 12.80 0.00 23.40 0.00 0.15 12.85 3.04 103.38 4.27 10.00 sdb 0.00 12.80 0.00 22.80 0.00 0.14 12.77 2.31 73.51 3.58 8.16 sdc 0.00 12.80 0.00 21.40 0.00 0.13 12.86 2.38 79.21 3.63 7.76 sdd 0.00 12.80 0.00 21.80 0.00 0.14 12.70 2.66 90.02 3.93 8.56 sde 0.00 2546.80 0.00 146.80 0.00 10.53 146.94 0.97 6.38 5.34 78.40 sdf 0.00 2546.80 0.00 146.60 0.00 10.53 147.05 0.97 6.38 5.53 81.04 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 231.40 0.00 566.80 0.00 3.16 11.41 124.92 228.26 1.76 99.52 sdb 0.00 223.00 0.00 558.00 0.00 3.06 11.23 46.64 83.55 1.70 94.88 sdc 0.00 230.60 0.00 551.60 0.00 3.07 11.40 94.38 171.54 1.76 96.96 sdd 0.00 231.40 0.00 528.60 0.00 2.94 11.37 122.55 220.81 1.83 96.48 sde 0.00 1495.80 0.00 99.00 0.00 6.23 128.86 0.81 8.15 7.76 76.80 sdf 0.00 1495.80 0.00 99.20 0.00 6.26 129.24 0.73 7.40 7.10 70.48 Top looks like: Cpu(s): 2.5%us, 1.9%sy, 0.0%ni, 71.9%id, 23.4%wa, 0.2%hi, 0.2%si, 0.0%st Mem: 16474084k total, 15750384k used, 723700k free, 1654320k buffers Swap: 2104440k total, 944k used, 2103496k free, 13552720k cached It looks to me like we are maxing out the raid 10 array, and I suspect the chunksize (4K) is the culprit. However as this is a pest to change (!) I'd like some opinions on whether I'm jumping to conclusions. I'd also appreciate comments about what chunksize to use (I've tended to use 256K in the past, but what are folks preferring these days?) regards Mark
On 3/24/09 6:09 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote: > I'm trying to pin down some performance issues with a machine where I > work, we are seeing (read only) query response times blow out by an > order of magnitude or more at busy times. Initially we blamed > autovacuum, but after a tweak of the cost_delay it is *not* the > problem. Then I looked at checkpoints... and altho there was some > correlation with them and the query response - I'm thinking that the > raid chunksize may well be the issue. > > Fortunately there is an identical DR box, so I could do a little > testing. Details follow: > > Sun 4140 2x quad-core opteron 2356 16G RAM, 6x 15K 140G SAS > Debian Lenny > Pg 8.3.6 > > The disk is laid out using software (md) raid: > > 4 drives raid 10 *4K* chunksize with database files (ext3 ordered, noatime) > 2 drives raid 1 with database transaction logs (ext3 ordered, noatime) > > > Top looks like: > > Cpu(s): 2.5%us, 1.9%sy, 0.0%ni, 71.9%id, 23.4%wa, 0.2%hi, 0.2%si, > 0.0%st > Mem: 16474084k total, 15750384k used, 723700k free, 1654320k buffers > Swap: 2104440k total, 944k used, 2103496k free, 13552720k cached > > It looks to me like we are maxing out the raid 10 array, and I suspect > the chunksize (4K) is the culprit. However as this is a pest to change > (!) I'd like some opinions on whether I'm jumping to conclusions. I'd > also appreciate comments about what chunksize to use (I've tended to use > 256K in the past, but what are folks preferring these days?) > > regards > > Mark > > md tends to work great at 1MB chunk sizes with RAID 1 or 10 for whatever reason. Unlike a hardware raid card, smaller chunks aren't going to help random i/o as it won't read the whole 1MB or bother caching much. Make sure any partitions built on top of md are 1MB aligned if you go that route. Random I/O on files smaller than 1MB would be affected -- but that's not a problem on a 16GB RAM server running a database that won't fit in RAM. Your xlogs are occasionally close to max usage too -- which is suspicious at 10MB/sec. There is no reason for them to be on ext3 since they are a transaction log that syncs writes so file system journaling doesn't mean anything. Ext2 there will lower the sync times and reduced i/o utilization. I also tend to use xfs if sequential access is important at all (obviously not so in pg_bench). ext3 is slightly safer in a power failure with unsyncd data, but Postgres has that covered with its own journal anyway so those differences are irrelevant.
On Tue, Mar 24, 2009 at 6:48 PM, Scott Carey <scott@richrelevance.com> wrote: > Your xlogs are occasionally close to max usage too -- which is suspicious at > 10MB/sec. There is no reason for them to be on ext3 since they are a > transaction log that syncs writes so file system journaling doesn't mean > anything. Ext2 there will lower the sync times and reduced i/o utilization. I would tend to recommend ext3 in data=writeback and make sure that it's mounted with noatime over using ext2 - for the sole reason that if the system shuts down unexpectedly, you don't have to worry about a long fsck when bringing it back up. Performance between the two filesystems should really be negligible for Postgres logging. -Dave
On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood <markir@paradise.net.nz> wrote: > I'm trying to pin down some performance issues with a machine where I work, > we are seeing (read only) query response times blow out by an order of > magnitude or more at busy times. Initially we blamed autovacuum, but after a > tweak of the cost_delay it is *not* the problem. Then I looked at > checkpoints... and altho there was some correlation with them and the query > response - I'm thinking that the raid chunksize may well be the issue. Sounds to me like you're mostly just running out of bandwidth on your RAID array. Whether or not you can tune it to run faster is the real issue. This problem becomes worse as you add clients and the RAID array starts to thrash. Thrashing is likely to be worse with a small chunk size, so that's definitely worth a look at fixing. > Fortunately there is an identical DR box, so I could do a little testing. Can you try changing the chunksize on the test box you're testing on to see if that helps?
On Tue, 24 Mar 2009, David Rees wrote: > I would tend to recommend ext3 in data=writeback and make sure that > it's mounted with noatime over using ext2 - for the sole reason that > if the system shuts down unexpectedly, you don't have to worry about a > long fsck when bringing it back up. Well, Mark's system is already using noatime, and if you believe http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ there's little difference between writeback and ordered on the WAL disk. Might squeeze out some improvements with ext2 though, and if there's nothing besides the WAL on there fsck isn't ever going to take very long anyway--not much of a directory tree to traverse there. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Scott Marlowe wrote: > On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood <markir@paradise.net.nz> wrote: > >> I'm trying to pin down some performance issues with a machine where I work, >> we are seeing (read only) query response times blow out by an order of >> magnitude or more at busy times. Initially we blamed autovacuum, but after a >> tweak of the cost_delay it is *not* the problem. Then I looked at >> checkpoints... and altho there was some correlation with them and the query >> response - I'm thinking that the raid chunksize may well be the issue. >> > > Sounds to me like you're mostly just running out of bandwidth on your > RAID array. Whether or not you can tune it to run faster is the real > issue. This problem becomes worse as you add clients and the RAID > array starts to thrash. Thrashing is likely to be worse with a small > chunk size, so that's definitely worth a look at fixing. > > Yeah, I was wondering if we are maxing out the bandwidth... >> Fortunately there is an identical DR box, so I could do a little testing. >> > > Can you try changing the chunksize on the test box you're testing on > to see if that helps? > > Yes - or I am hoping to anyway (part of posting here was to collect some outside validation for the idea). Thanks for your input! Cheers Mark
On Wed, 25 Mar 2009, Mark Kirkwood wrote: > I'm thinking that the raid chunksize may well be the issue. Why? I'm not saying you're wrong, I just don't see why that parameter jumped out as a likely cause here. > Sun 4140 2x quad-core opteron 2356 16G RAM, 6x 15K 140G SAS That server doesn't have any sort of write cache on it, right? That means that all the fsync's done near checkpoint time are going to thrash your disks around. One thing you can do to improve that situation is push checkpoint_segments up to the maximum you can possibly stand. You could consider double or even quadruple what you're using right now, the recovery time after a crash will spike upwards a bunch though. That will minimize the number of checkpoints and reduce the average disk I/O they produce per unit of time, due to how they're spread out in 8.3. You might bump upwards checkpoint_completion_target to 0.9 in order to get some improvement without increasing recovery time as badly. Also, if you want to minimize total I/O, you might drop bgwriter_lru_maxpages to 0. That feature presumes you have some spare I/O capacity you use to prioritize lower latency, and it sounds like you don't. You get the lowest total I/O per transaction with the background writer turned off. You happened to catch me on a night where I was running some pgbench tests here, so I can give you something similar to compare against. Quad-core system, 8GB of RAM, write-caching controller with 3-disk RAID0 for database and 1 disk for WAL; Linux software RAID though. Here's the same data you collected at the same scale you're testing, with similar postgresql.conf settings too (same shared_buffers and checkpoint_segments, I didn't touch any of the vacuum parameters): number of clients: 32 number of transactions per client: 6250 number of transactions actually processed: 200000/200000 tps = 1097.933319 (including connections establishing) tps = 1098.372510 (excluding connections establishing) Cpu(s): 3.6%us, 1.0%sy, 0.0%ni, 57.2%id, 37.5%wa, 0.0%hi, 0.7%si, 0.0%st Mem: 8174288k total, 5545396k used, 2628892k free, 473248k buffers Swap: 0k total, 0k used, 0k free, 4050736k cached sda,b,d are the database, sdc is the WAL, here's a couple of busy periods: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 337.26 0.00 380.72 0.00 2.83 15.24 104.98 278.77 2.46 93.55 sdb 0.00 343.56 0.00 386.31 0.00 2.86 15.17 91.32 236.61 2.46 94.95 sdd 0.00 342.86 0.00 391.71 0.00 2.92 15.28 128.36 342.42 2.43 95.14 sdc 0.00 808.89 0.00 45.45 0.00 3.35 150.72 1.22 26.75 21.13 96.02 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 377.82 0.00 423.38 0.00 3.13 15.12 74.24 175.21 1.41 59.58 sdb 0.00 371.73 0.00 423.18 0.00 3.13 15.15 50.61 119.81 1.41 59.58 sdd 0.00 372.93 0.00 414.99 0.00 3.06 15.09 60.02 144.32 1.44 59.70 sdc 0.00 3242.16 0.00 258.84 0.00 13.68 108.23 0.88 3.42 2.96 76.60 They don't really look much different from yours. I'm using software RAID and haven't touched any of its parameters; didn't even use noatime on the ext3 filesystems (you should though--that's one of those things the write cache really helps out with in my case). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
It sounds to me like you need to tune everything you can related to postgresql, but it will unlikely be enough as your load continues to increase. You might want to look into moving some of the read activity off of the database. Depending on you application, memcached or ehcache could help. You could also look at using something like Tokyo Cabinet as a short term front end data store. Without understanding the application architecture, I can't offer much in way of a specific suggestion. -Jerry Jerry Champlin Absolute Performance Inc. Mark Kirkwood wrote: > Scott Marlowe wrote: >> On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood >> <markir@paradise.net.nz> wrote: >> >>> I'm trying to pin down some performance issues with a machine where >>> I work, >>> we are seeing (read only) query response times blow out by an order of >>> magnitude or more at busy times. Initially we blamed autovacuum, but >>> after a >>> tweak of the cost_delay it is *not* the problem. Then I looked at >>> checkpoints... and altho there was some correlation with them and >>> the query >>> response - I'm thinking that the raid chunksize may well be the issue. >>> >> >> Sounds to me like you're mostly just running out of bandwidth on your >> RAID array. Whether or not you can tune it to run faster is the real >> issue. This problem becomes worse as you add clients and the RAID >> array starts to thrash. Thrashing is likely to be worse with a small >> chunk size, so that's definitely worth a look at fixing. >> >> > > Yeah, I was wondering if we are maxing out the bandwidth... >>> Fortunately there is an identical DR box, so I could do a little >>> testing. >>> >> >> Can you try changing the chunksize on the test box you're testing on >> to see if that helps? >> >> > > Yes - or I am hoping to anyway (part of posting here was to collect > some outside validation for the idea). Thanks for your input! > > > Cheers > > Mark >
On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: > On Wed, 25 Mar 2009, Mark Kirkwood wrote: > >> I'm thinking that the raid chunksize may well be the issue. > > Why? I'm not saying you're wrong, I just don't see why that parameter > jumped out as a likely cause here. > If postgres is random reading or writing at 8k block size, and the raid array is set with 4k block size, then every 8k random i/o will create TWO disk seeks since it gets split to two disks. Effectively, iops will be cut in half.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Kirkwood wrote: > I'm trying to pin down some performance issues with a machine where > I work, we are seeing (read only) query response times blow out by > an order of magnitude or more at busy times. Initially we blamed > autovacuum, but after a tweak of the cost_delay it is *not* the > problem. Then I looked at checkpoints... and altho there was some > correlation with them and the query response - I'm thinking that > the raid chunksize may well be the issue. > > Fortunately there is an identical DR box, so I could do a little > testing. Details follow: > > Sun 4140 2x quad-core opteron 2356 16G RAM, 6x 15K 140G SAS Debian > Lenny Pg 8.3.6 > > The disk is laid out using software (md) raid: > > 4 drives raid 10 *4K* chunksize with database files (ext3 ordered, > noatime) 2 drives raid 1 with database transaction logs (ext3 > ordered, noatime) > > The relevant non default .conf params are: > > shared_buffers = 2048MB work_mem = 4MB > maintenance_work_mem = 1024MB max_fsm_pages = 153600 > bgwriter_lru_maxpages = 200 wal_buffers = 2MB > checkpoint_segments = 32 effective_cache_size = 4096MB > autovacuum_vacuum_scale_factor = 0.1 autovacuum_vacuum_cost_delay > = 60 # This is high, but seemed to help... > > I've run pgbench: > > transaction type: TPC-B (sort of) scaling factor: 100 number of > clients: 24 number of transactions per client: 12000 number of > transactions actually processed: 288000/288000 tps = 655.335102 > (including connections establishing) tps = 655.423232 (excluding > connections establishing) > > > Looking at iostat while it is running shows (note sda-sdd raid10, > sde and sdf raid 1): > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util sda 0.00 > 56.80 0.00 579.00 0.00 2.47 8.74 133.76 235.10 > 1.73 100.00 sdb 0.00 45.60 0.00 583.60 > 0.00 2.45 8.59 52.65 90.03 1.71 100.00 sdc > 0.00 49.00 0.00 579.80 0.00 2.45 8.66 72.56 > 125.09 1.72 100.00 sdd 0.00 58.40 0.00 > 565.00 0.00 2.42 8.79 135.31 235.52 1.77 100.00 sde > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util sda 0.00 > 12.80 0.00 23.40 0.00 0.15 12.85 3.04 103.38 > 4.27 10.00 sdb 0.00 12.80 0.00 22.80 > 0.00 0.14 12.77 2.31 73.51 3.58 8.16 sdc > 0.00 12.80 0.00 21.40 0.00 0.13 12.86 2.38 > 79.21 3.63 7.76 sdd 0.00 12.80 0.00 21.80 > 0.00 0.14 12.70 2.66 90.02 3.93 8.56 sde > 0.00 2546.80 0.00 146.80 0.00 10.53 146.94 0.97 > 6.38 5.34 78.40 sdf 0.00 2546.80 0.00 146.60 > 0.00 10.53 147.05 0.97 6.38 5.53 81.04 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util sda 0.00 > 231.40 0.00 566.80 0.00 3.16 11.41 124.92 228.26 > 1.76 99.52 sdb 0.00 223.00 0.00 558.00 > 0.00 3.06 11.23 46.64 83.55 1.70 94.88 sdc > 0.00 230.60 0.00 551.60 0.00 3.07 11.40 94.38 > 171.54 1.76 96.96 sdd 0.00 231.40 0.00 > 528.60 0.00 2.94 11.37 122.55 220.81 1.83 96.48 sde > 0.00 1495.80 0.00 99.00 0.00 6.23 128.86 0.81 > 8.15 7.76 76.80 sdf 0.00 1495.80 0.00 99.20 > 0.00 6.26 129.24 0.73 7.40 7.10 70.48 > > Top looks like: > > Cpu(s): 2.5%us, 1.9%sy, 0.0%ni, 71.9%id, 23.4%wa, 0.2%hi, > 0.2%si, 0.0%st Mem: 16474084k total, 15750384k used, 723700k > free, 1654320k buffers Swap: 2104440k total, 944k used, > 2103496k free, 13552720k cached > > It looks to me like we are maxing out the raid 10 array, and I > suspect the chunksize (4K) is the culprit. However as this is a > pest to change (!) I'd like some opinions on whether I'm jumping to > conclusions. I'd also appreciate comments about what chunksize to > use (I've tended to use 256K in the past, but what are folks > preferring these days?) > > regards > > Mark > > > Hello Mark, Okay, so, take all of this with a pinch of salt, but, I have the same config (pretty much) as you, with checkpoint_Segments raised to 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA into motherboard which I then lvm stripped together; lvcreate -n data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me a stripe size of 64. Running pgbench with the same scaling factors; starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 1398.907206 (including connections establishing) tps = 1399.233785 (excluding connections establishing) It's also running ext4dev, but, this is the 'playground' server, not the real iron (And I dread to do that on the real iron). In short, I think that chunksize/stripesize is killing you. Personally, I would go for 64 or 128 .. that's jst my 2c .. feel free to ignore/scorn/laugh as applicable ;) Regards Stef -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknK0UsACgkQANG7uQ+9D9VK3wCeO/guLVb4K4V7VAQ29hJsmstb 2JMAmQEmJjNTQlxng/49D2/xHNw2W19/ =/rKD -----END PGP SIGNATURE-----
Greg Smith wrote: > On Wed, 25 Mar 2009, Mark Kirkwood wrote: > >> I'm thinking that the raid chunksize may well be the issue. > > Why? I'm not saying you're wrong, I just don't see why that parameter > jumped out as a likely cause here. > See my other post, however I agree - it wasn't clear whether split writes (from the small chunksize) were killing us or the array was simply maxed out... >> Sun 4140 2x quad-core opteron 2356 16G RAM, 6x 15K 140G SAS > > That server doesn't have any sort of write cache on it, right? That > means that all the fsync's done near checkpoint time are going to > thrash your disks around. One thing you can do to improve that > situation is push checkpoint_segments up to the maximum you can > possibly stand. You could consider double or even quadruple what > you're using right now, the recovery time after a crash will spike > upwards a bunch though. That will minimize the number of checkpoints > and reduce the average disk I/O they produce per unit of time, due to > how they're spread out in 8.3. You might bump upwards > checkpoint_completion_target to 0.9 in order to get some improvement > without increasing recovery time as badly. > Yeah, no write cache at all. > Also, if you want to minimize total I/O, you might drop > bgwriter_lru_maxpages to 0. That feature presumes you have some spare > I/O capacity you use to prioritize lower latency, and it sounds like > you don't. You get the lowest total I/O per transaction with the > background writer turned off. > Right - but then a big very noticeable stall when you do have to checkpoint? We want to avoid that I think, even at the cost of a little overall throughput. > You happened to catch me on a night where I was running some pgbench > tests here, so I can give you something similar to compare against. > Quad-core system, 8GB of RAM, write-caching controller with 3-disk > RAID0 for database and 1 disk for WAL; Linux software RAID though. > Here's the same data you collected at the same scale you're testing, > with similar postgresql.conf settings too (same shared_buffers and > checkpoint_segments, I didn't touch any of the vacuum parameters): > > number of clients: 32 > number of transactions per client: 6250 > number of transactions actually processed: 200000/200000 > tps = 1097.933319 (including connections establishing) > tps = 1098.372510 (excluding connections establishing) > > Cpu(s): 3.6%us, 1.0%sy, 0.0%ni, 57.2%id, 37.5%wa, 0.0%hi, > 0.7%si, 0.0%st > Mem: 8174288k total, 5545396k used, 2628892k free, 473248k buffers > Swap: 0k total, 0k used, 0k free, 4050736k cached > > sda,b,d are the database, sdc is the WAL, here's a couple of busy > periods: > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 337.26 0.00 380.72 0.00 2.83 > 15.24 104.98 278.77 2.46 93.55 > sdb 0.00 343.56 0.00 386.31 0.00 2.86 > 15.17 91.32 236.61 2.46 94.95 > sdd 0.00 342.86 0.00 391.71 0.00 2.92 > 15.28 128.36 342.42 2.43 95.14 > sdc 0.00 808.89 0.00 45.45 0.00 3.35 > 150.72 1.22 26.75 21.13 96.02 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 377.82 0.00 423.38 0.00 3.13 > 15.12 74.24 175.21 1.41 59.58 > sdb 0.00 371.73 0.00 423.18 0.00 3.13 > 15.15 50.61 119.81 1.41 59.58 > sdd 0.00 372.93 0.00 414.99 0.00 3.06 > 15.09 60.02 144.32 1.44 59.70 > sdc 0.00 3242.16 0.00 258.84 0.00 13.68 > 108.23 0.88 3.42 2.96 76.60 > > They don't really look much different from yours. I'm using software > RAID and haven't touched any of its parameters; didn't even use > noatime on the ext3 filesystems (you should though--that's one of > those things the write cache really helps out with in my case). > Yeah - with 64K chunksize I'm seeing a result more congruent with yours (866 or so for 24 clients), I think another pair of disks so we could have 3 effective disks for the database would help get us to similar results to yours... however for the meantime I'm trying to get the best out of what's there! Thanks for your help Mark
Stef Telford wrote: > > Hello Mark, > Okay, so, take all of this with a pinch of salt, but, I have the > same config (pretty much) as you, with checkpoint_Segments raised to > 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA > into motherboard which I then lvm stripped together; lvcreate -n > data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me > a stripe size of 64. Running pgbench with the same scaling factors; > > starting vacuum...end. > transaction type: TPC-B (sort of) > scaling factor: 100 > number of clients: 24 > number of transactions per client: 12000 > number of transactions actually processed: 288000/288000 > tps = 1398.907206 (including connections establishing) > tps = 1399.233785 (excluding connections establishing) > > It's also running ext4dev, but, this is the 'playground' server, > not the real iron (And I dread to do that on the real iron). In short, > I think that chunksize/stripesize is killing you. Personally, I would > go for 64 or 128 .. that's jst my 2c .. feel free to > ignore/scorn/laugh as applicable ;) > > Stef - I suspect that your (quite high) tps is because your SATA disks are not honoring the fsync() request for each commit. SCSI/SAS disks tend to by default flush their cache at fsync - ATA/SATA tend not to. Some filesystems (e.g xfs) will try to work around this with write barrier support, but it depends on the disk firmware. Thanks for your reply! Mark
I wrote: > Scott Marlowe wrote: > >> >> Can you try changing the chunksize on the test box you're testing on >> to see if that helps? >> >> > > Yes - or I am hoping to anyway (part of posting here was to collect > some outside validation for the idea). Thanks for your input! > Rebuilt with 64K chunksize: transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 866.512162 (including connections establishing) tps = 866.651320 (excluding connections establishing) So 64K looks quite a bit better. I'll endeavor to try out 256K next week too. Mark
On Thu, 26 Mar 2009, Mark Kirkwood wrote: >> Also, if you want to minimize total I/O, you might drop >> bgwriter_lru_maxpages to 0. That feature presumes you have some spare I/O >> capacity you use to prioritize lower latency, and it sounds like you don't. >> You get the lowest total I/O per transaction with the background writer >> turned off. >> > > Right - but then a big very noticeable stall when you do have to checkpoint? > We want to avoid that I think, even at the cost of a little overall > throughput. There's not really a big difference if you're running with a large value for checkpoing_segments. That spreads the checkpoint I/O over a longer period of time. The current background writer doesn't aim to reduce writes at checkpoint time, because that never really worked out like people expected it to anyway. It's aimed instead to write out buffers that database backend processes are going to need fairly soon, so they are less likely to block because they have to write them out themselves. That leads to an occasional bit of wasted I/O, where the buffer written out gets used or dirtied against before it can be assigned to a backend. I've got a long paper expanding on the documentation here you might find useful: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm > Yeah - with 64K chunksize I'm seeing a result more congruent with yours > (866 or so for 24 clients) That's good to hear. If adjusting that helped so much, you might consider aligning the filesystem partitions to the chunk size too; the partition header usually screws that up on Linux. See these two references for ideas: http://www.vmware.com/resources/techresources/608 http://spiralbound.net/2008/06/09/creating-linux-partitions-for-clariion -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 3/25/09 9:43 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote: > Stef Telford wrote: >> >> Hello Mark, >> Okay, so, take all of this with a pinch of salt, but, I have the >> same config (pretty much) as you, with checkpoint_Segments raised to >> 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA >> into motherboard which I then lvm stripped together; lvcreate -n >> data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me >> a stripe size of 64. Running pgbench with the same scaling factors; >> >> starting vacuum...end. >> transaction type: TPC-B (sort of) >> scaling factor: 100 >> number of clients: 24 >> number of transactions per client: 12000 >> number of transactions actually processed: 288000/288000 >> tps = 1398.907206 (including connections establishing) >> tps = 1399.233785 (excluding connections establishing) >> >> It's also running ext4dev, but, this is the 'playground' server, >> not the real iron (And I dread to do that on the real iron). In short, >> I think that chunksize/stripesize is killing you. Personally, I would >> go for 64 or 128 .. that's jst my 2c .. feel free to >> ignore/scorn/laugh as applicable ;) >> >> > Stef - I suspect that your (quite high) tps is because your SATA disks > are not honoring the fsync() request for each commit. SCSI/SAS disks > tend to by default flush their cache at fsync - ATA/SATA tend not to. > Some filesystems (e.g xfs) will try to work around this with write > barrier support, but it depends on the disk firmware. This has not been very true for a while now. SATA disks will flush their write cache when told, and properly adhere to write barriers. Of course, not all file systems send the right write barrier commands and flush commands to SATA drives (UFS for example, and older versions of ext3). It may be the other way around, your SAS drives might have the write cache disabled for no good reason other than to protect against file systems that don't work right. > > Thanks for your reply! > > Mark > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 3/25/09 9:28 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote: > I wrote: >> Scott Marlowe wrote: >> >>> >>> Can you try changing the chunksize on the test box you're testing on >>> to see if that helps? >>> >>> >> >> Yes - or I am hoping to anyway (part of posting here was to collect >> some outside validation for the idea). Thanks for your input! >> > > Rebuilt with 64K chunksize: > > transaction type: TPC-B (sort of) > scaling factor: 100 > number of clients: 24 > number of transactions per client: 12000 > number of transactions actually processed: 288000/288000 > tps = 866.512162 (including connections establishing) > tps = 866.651320 (excluding connections establishing) > > > So 64K looks quite a bit better. I'll endeavor to try out 256K next week > too. Just go all the way to 1MB, md _really_ likes 1MB chunk sizes for some reason. Benchmarks right and left on google show this to be optimal. My tests with md raid 0 over hardware raid 10's ended up with that being optimal as well. Greg's notes on aligning partitions to the chunk are key as well. > > Mark > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 3/26/09 2:44 PM, "Scott Carey" <scott@richrelevance.com> wrote: > > > On 3/25/09 9:43 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote: > >> Stef Telford wrote: >>> >>> Hello Mark, >>> Okay, so, take all of this with a pinch of salt, but, I have the >>> same config (pretty much) as you, with checkpoint_Segments raised to >>> 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA >>> into motherboard which I then lvm stripped together; lvcreate -n >>> data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me >>> a stripe size of 64. Running pgbench with the same scaling factors; >>> >>> starting vacuum...end. >>> transaction type: TPC-B (sort of) >>> scaling factor: 100 >>> number of clients: 24 >>> number of transactions per client: 12000 >>> number of transactions actually processed: 288000/288000 >>> tps = 1398.907206 (including connections establishing) >>> tps = 1399.233785 (excluding connections establishing) >>> >>> It's also running ext4dev, but, this is the 'playground' server, >>> not the real iron (And I dread to do that on the real iron). In short, >>> I think that chunksize/stripesize is killing you. Personally, I would >>> go for 64 or 128 .. that's jst my 2c .. feel free to >>> ignore/scorn/laugh as applicable ;) >>> >>> >> Stef - I suspect that your (quite high) tps is because your SATA disks >> are not honoring the fsync() request for each commit. SCSI/SAS disks >> tend to by default flush their cache at fsync - ATA/SATA tend not to. >> Some filesystems (e.g xfs) will try to work around this with write >> barrier support, but it depends on the disk firmware. > > This has not been very true for a while now. SATA disks will flush their > write cache when told, and properly adhere to write barriers. Of course, > not all file systems send the right write barrier commands and flush > commands to SATA drives (UFS for example, and older versions of ext3). > > It may be the other way around, your SAS drives might have the write cache > disabled for no good reason other than to protect against file systems that > don't work right. > A little extra info here >> md, LVM, and some other tools do not allow the file system to use write barriers properly.... So those are on the bad list for data integrity with SAS or SATA write caches without battery back-up. However, this is NOT an issue on the postgres data partition. Data fsync still works fine, its the file system journal that might have out-of-order writes. For xlogs, write barriers are not important, only fsync() not lying. As an additional note, ext4 uses checksums per block in the journal, so it is resistant to out of order writes causing trouble. The test compared to here was on ext4, and most likely the speed increase is partly due to that. >> >> Thanks for your reply! >> >> Mark >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Scott Carey wrote: > > A little extra info here >> md, LVM, and some other tools do not allow the > file system to use write barriers properly.... So those are on the bad list > for data integrity with SAS or SATA write caches without battery back-up. > However, this is NOT an issue on the postgres data partition. Data fsync > still works fine, its the file system journal that might have out-of-order > writes. For xlogs, write barriers are not important, only fsync() not > lying. > > As an additional note, ext4 uses checksums per block in the journal, so it > is resistant to out of order writes causing trouble. The test compared to > here was on ext4, and most likely the speed increase is partly due to that. > > [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly suspicious of such a system being capable of outperforming one with the same number of (effective) - much faster - disks *plus* a dedicated WAL disk pair... unless it is being a little loose about fsync! I'm happy to believe ext4 is better than ext3 - but not that much! However, its great to have so many different results to compare against! Cheers Mark
Scott Carey wrote: > On 3/25/09 9:28 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote: > > >> >> Rebuilt with 64K chunksize: >> >> transaction type: TPC-B (sort of) >> scaling factor: 100 >> number of clients: 24 >> number of transactions per client: 12000 >> number of transactions actually processed: 288000/288000 >> tps = 866.512162 (including connections establishing) >> tps = 866.651320 (excluding connections establishing) >> >> >> So 64K looks quite a bit better. I'll endeavor to try out 256K next week >> too. >> > > Just go all the way to 1MB, md _really_ likes 1MB chunk sizes for some > reason. Benchmarks right and left on google show this to be optimal. My > tests with md raid 0 over hardware raid 10's ended up with that being > optimal as well. > > Greg's notes on aligning partitions to the chunk are key as well. > > Rebuilt with 256K chunksize: transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 942.852104 (including connections establishing) tps = 943.019223 (excluding connections establishing) A noticeable improvement again. I'm not sure that we will have time (or patience from the system guys that I keep bugging to redo the raid setup!) to try 1M, but 256K gets us 40% or so improvement over the original 4K setup - which is quite nice! Looking on the net for md raid benchmarks, it is not 100% clear to me that 1M is the overall best - several I found had tested sizes like 64K, 128K, 512K, 1M and concluded that 1M was best - but without testing 256K! whereas others had included ranges <=512K and decided that that 256K was the best. I'd be very interested in seeing your data! (several years ago I had carried out this type of testing - on a different type of machine, and for a different database vendor, but found that 256K seemed to give the overall best result). The next step is to align the raid 10 partitions, as you and Greg suggest and see what effect that has! Thanks again Mark
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Kirkwood wrote: > Scott Carey wrote: >> >> A little extra info here >> md, LVM, and some other tools do not >> allow the file system to use write barriers properly.... So >> those are on the bad list for data integrity with SAS or SATA >> write caches without battery back-up. However, this is NOT an >> issue on the postgres data partition. Data fsync still works >> fine, its the file system journal that might have out-of-order >> writes. For xlogs, write barriers are not important, only >> fsync() not lying. >> >> As an additional note, ext4 uses checksums per block in the >> journal, so it is resistant to out of order writes causing >> trouble. The test compared to here was on ext4, and most likely >> the speed increase is partly due to that. >> >> > > [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still > highly suspicious of such a system being capable of outperforming > one with the same number of (effective) - much faster - disks > *plus* a dedicated WAL disk pair... unless it is being a little > loose about fsync! I'm happy to believe ext4 is better than ext3 - > but not that much! > > However, its great to have so many different results to compare > against! > > Cheers > > Mark > Hello Mark, For the record, this is a 'base' debian 5 install (with openVZ but postgreSQL is running on the base hardware, not inside a container) and I have -explicitly- enabled sync in the conf. Eg; fsync = on # turns forced synchronization on or off synchronous_commit = on # immediate fsync at commit #wal_sync_method = fsync # the default is the first option Infact, if I turn -off- sync commit, it gets about 200 -slower- rather than faster. Curiously, I also have an intel x25-m winging it's way here for testing/benching under postgreSQL (along with a vertex 120gb). I had one of the nice lads on the OCZ forum bench against a 30gb vertex ssd, and if you think -my- TPS was crazy.. you should have seen his. postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 3662.200088 (including connections establishing) tps = 3664.823769 (excluding connections establishing) (Nb; Thread here; http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 ) Curiously, I think with SSD's there may have to be an 'off' flag if you put the xlog onto an ssd. It seems to complain about 'too frequent checkpoints'. I can't wait for -either- of the drives to arrive. I want to see in -my- system what the speed is like for SSD's. The dataset I have to work with is fairly small (30-40GB) so, using an 80GB ssd (even a few raided) is possible for me. Thankfully ;) Regards Stef (ps. I should note, running postgreSQL in a prod environment -without- a nice UPS is never going to happen on my watch, so, turning on write-cache (to me) seems like a no-brainer really if it makes this kind of boost possible) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknTfKMACgkQANG7uQ+9D9XZ7wCfdU3JDXj1f2Em9dt7GdcxRbWR eHUAn1zDb3HKEiAb0d/0R1MubtE44o/k =HXmP -----END PGP SIGNATURE-----
On Wed, 1 Apr 2009, Stef Telford wrote: > I have -explicitly- enabled sync in the conf...In fact, if I turn -off- > sync commit, it gets about 200 -slower- rather than faster. You should take a look at http://www.postgresql.org/docs/8.3/static/wal-reliability.html And check the output from "hdparm -I" as suggested there. If turning off fsync doesn't improve your performance, there's almost certainly something wrong with your setup. As suggested before, your drives probably have write caching turned on. PostgreSQL is incapable of knowing that, and will happily write in an unsafe manner even if the fsync parameter is turned on. There's a bunch more information on this topic at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm Also: a run to run variation in pgbench results of +/-10% TPS is normal, so unless you saw a consistent 200 TPS gain during multiple tests my guess is that changing fsync for you is doing nothing, rather than you suggestion that it makes things slower. > Curiously, I think with SSD's there may have to be an 'off' flag > if you put the xlog onto an ssd. It seems to complain about 'too > frequent checkpoints'. You just need to increase checkpoint_segments from the tiny default if you want to push any reasonable numbers of transactions/second through pgbench without seeing this warning. Same thing happens with any high-performance disk setup, it's not specific to SSDs. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Greg Smith wrote: > On Wed, 1 Apr 2009, Stef Telford wrote: > >> I have -explicitly- enabled sync in the conf...In fact, if I turn >> -off- sync commit, it gets about 200 -slower- rather than >> faster. > > You should take a look at > http://www.postgresql.org/docs/8.3/static/wal-reliability.html > > And check the output from "hdparm -I" as suggested there. If > turning off fsync doesn't improve your performance, there's almost > certainly something wrong with your setup. As suggested before, > your drives probably have write caching turned on. PostgreSQL is > incapable of knowing that, and will happily write in an unsafe > manner even if the fsync parameter is turned on. There's a bunch > more information on this topic at > http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm > > Also: a run to run variation in pgbench results of +/-10% TPS is > normal, so unless you saw a consistent 200 TPS gain during multiple > tests my guess is that changing fsync for you is doing nothing, > rather than you suggestion that it makes things slower. > Hello Greg, Turning off fsync -does- increase the throughput noticeably, - -however-, turning off synchronous_commit seemed to slow things down for me. Your right though, when I toggled the sync_commit on the system, there was a small variation with TPS coming out between 1100 and 1300. I guess I saw the initial run and thought that there was a 'loss' in sync_commit = off I do agree that the benefit is probably from write-caching, but I think that this is a 'win' as long as you have a UPS or BBU adaptor, and really, in a prod environment, not having a UPS is .. well. Crazy ? >> Curiously, I think with SSD's there may have to be an 'off' flag >> if you put the xlog onto an ssd. It seems to complain about 'too >> frequent checkpoints'. > > You just need to increase checkpoint_segments from the tiny default > if you want to push any reasonable numbers of transactions/second > through pgbench without seeing this warning. Same thing happens > with any high-performance disk setup, it's not specific to SSDs. > Good to know, I thought it maybe was atypical behaviour due to the nature of SSD's. Regards Stef > -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknTky0ACgkQANG7uQ+9D9UuNwCghLLC96mj9zzZPUF4GLvBDlQk fyIAn0V63YZJGzfm+4zPB9zjm8YKn42X =A6x2 -----END PGP SIGNATURE-----
On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: > I do agree that the benefit is probably from write-caching, but I > think that this is a 'win' as long as you have a UPS or BBU adaptor, > and really, in a prod environment, not having a UPS is .. well. Crazy ? You do know that UPSes can fail, right? En masse sometimes even.
Scott Marlowe wrote: > On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: > >> I do agree that the benefit is probably from write-caching, but I >> think that this is a 'win' as long as you have a UPS or BBU adaptor, >> and really, in a prod environment, not having a UPS is .. well. Crazy ? >> > > You do know that UPSes can fail, right? En masse sometimes even. > Hello Scott, Well, the only time the UPS has failed in my memory, was during the great Eastern Seaboard power outage of 2003. Lots of fond memories running around Toronto with a gas can looking for oil for generator power. This said though, anything could happen, the co-lo could be taken out by a meteor and then sync on or off makes no difference. Good UPS, a warm PITR standby, offsite backups and regular checks is "good enough" for me, and really, that's what it all comes down to. Mitigating risk and factors into an 'acceptable' amount for each person. However, if you see over a 2x improvement from turning write-cache 'on' and have everything else in place, well, that seems like a 'no-brainer' to me, at least ;) Regards Stef
On Wed, 1 Apr 2009, Scott Marlowe wrote: > On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: >> I do agree that the benefit is probably from write-caching, but I >> think that this is a 'win' as long as you have a UPS or BBU adaptor, >> and really, in a prod environment, not having a UPS is .. well. Crazy ? > > You do know that UPSes can fail, right? En masse sometimes even. I just lost all my diary appointments and address book data on my Palm device, because of a similar attitude. The device stores all its data in RAM, and never syncs it to permanent storage (like the SD card in the expansion slot). But that's fine, right, because it has a battery, therefore it can never fail? Well, it has the failure mode that if it ever crashes hard, or the battery fails momentarily due to jogging around in a pocket, then it just wipes all its data and starts from scratch. Computers crash. Hardware fails. Relying on un-backed-up RAM to keep your data safe does not work. Matthew -- "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." -- Rich Cook
On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote: > Scott Marlowe wrote: >> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: >> >>> I do agree that the benefit is probably from write-caching, but I >>> think that this is a 'win' as long as you have a UPS or BBU adaptor, >>> and really, in a prod environment, not having a UPS is .. well. Crazy ? >>> >> >> You do know that UPSes can fail, right? En masse sometimes even. >> > Hello Scott, > Well, the only time the UPS has failed in my memory, was during the > great Eastern Seaboard power outage of 2003. Lots of fond memories > running around Toronto with a gas can looking for oil for generator > power. This said though, anything could happen, the co-lo could be taken > out by a meteor and then sync on or off makes no difference. Meteor strike is far less likely than a power surge taking out a UPS. I saw a whole data center go black when a power conditioner blew out, taking out the other three power conditioners, both industrial UPSes and the switch for the diesel generator. And I have friends who have seen the same type of thing before as well. The data is the most expensive part of any server.
On Wed, 1 Apr 2009, Stef Telford wrote: > Good UPS, a warm PITR standby, offsite backups and regular checks is > "good enough" for me, and really, that's what it all comes down to. > Mitigating risk and factors into an 'acceptable' amount for each person. > However, if you see over a 2x improvement from turning write-cache 'on' > and have everything else in place, well, that seems like a 'no-brainer' > to me, at least ;) In that case, buying a battery-backed-up cache in the RAID controller would be even more of a no-brainer. Matthew -- If pro is the opposite of con, what is the opposite of progress?
On Wed, Apr 1, 2009 at 11:01 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Wed, 1 Apr 2009, Stef Telford wrote: >> >> Good UPS, a warm PITR standby, offsite backups and regular checks is >> "good enough" for me, and really, that's what it all comes down to. >> Mitigating risk and factors into an 'acceptable' amount for each person. >> However, if you see over a 2x improvement from turning write-cache 'on' >> and have everything else in place, well, that seems like a 'no-brainer' >> to me, at least ;) > > In that case, buying a battery-backed-up cache in the RAID controller would > be even more of a no-brainer. This is especially true in that you can reduce downtime. A lot of times downtime costs as much as anything else.
Matthew Wakeling wrote: > On Wed, 1 Apr 2009, Stef Telford wrote: >> Good UPS, a warm PITR standby, offsite backups and regular checks is >> "good enough" for me, and really, that's what it all comes down to. >> Mitigating risk and factors into an 'acceptable' amount for each person. >> However, if you see over a 2x improvement from turning write-cache 'on' >> and have everything else in place, well, that seems like a 'no-brainer' >> to me, at least ;) > > In that case, buying a battery-backed-up cache in the RAID controller > would be even more of a no-brainer. > > Matthew > Hey Matthew, See about 3 messages ago.. We already have them (I did say UPS or BBU, it should have been a logical 'and' instead of logical 'or' .. my bad ;). Your right though, that was a no-brainer as well. I am wondering how the card (3ware 9550sx) will work with SSD's, md or lvm, blocksize, ext3 or ext4 .. but.. this is the point of benchmarking ;) Regards Stef
On Wed, 1 Apr 2009, Scott Marlowe wrote: > Meteor strike is far less likely than a power surge taking out a UPS. I average having a system go down during a power outage because the UPS it was attached to wasn't working right anymore about once every five years. And I don't usually manage that many systems. The only real way to know if a UPS is working right is to actually detach power and confirm the battery still works, which is downtime nobody ever feels is warranted for a production system. Then, one day the power dies, the UPS battery doesn't work to spec anymore, and you're done. Of course, I have a BBC controller in my home desktop, so that gives you an idea where I'm at as far as paranoia here goes. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 1 Apr 2009, Greg Smith wrote: > The only real way to know if a UPS is working right is to actually detach > power and confirm the battery still works, which is downtime nobody ever > feels is warranted for a production system. Then, one day the power dies, > the UPS battery doesn't work to spec anymore, and you're done. Most decent servers have dual power supplies, and they should really be connected to two independent UPS units. You can test them one by one without much risk of bringing down your server. Matthew -- Okay, I'm weird! But I'm saving up to be eccentric.
On Wed, Apr 1, 2009 at 11:54 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Wed, 1 Apr 2009, Greg Smith wrote: >> >> The only real way to know if a UPS is working right is to actually detach >> power and confirm the battery still works, which is downtime nobody ever >> feels is warranted for a production system. Then, one day the power dies, >> the UPS battery doesn't work to spec anymore, and you're done. > > Most decent servers have dual power supplies, and they should really be > connected to two independent UPS units. You can test them one by one without > much risk of bringing down your server. Yeah, our primary DB servers have three PSes and can run on any two just fine. We have three power busses each coming from a different UPS at the hosting center.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stef Telford wrote: > Mark Kirkwood wrote: >> Scott Carey wrote: >>> A little extra info here >> md, LVM, and some other tools do >>> not allow the file system to use write barriers properly.... So >>> those are on the bad list for data integrity with SAS or SATA >>> write caches without battery back-up. However, this is NOT an >>> issue on the postgres data partition. Data fsync still works >>> fine, its the file system journal that might have out-of-order >>> writes. For xlogs, write barriers are not important, only >>> fsync() not lying. >>> >>> As an additional note, ext4 uses checksums per block in the >>> journal, so it is resistant to out of order writes causing >>> trouble. The test compared to here was on ext4, and most >>> likely the speed increase is partly due to that. >>> >>> >> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still >> highly suspicious of such a system being capable of outperforming >> one with the same number of (effective) - much faster - disks >> *plus* a dedicated WAL disk pair... unless it is being a little >> loose about fsync! I'm happy to believe ext4 is better than ext3 >> - but not that much! > >> However, its great to have so many different results to compare >> against! > >> Cheers > >> Mark > > postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24 > -t 12000 test_db starting vacuum...end. transaction type: TPC-B > (sort of) scaling factor: 100 number of clients: 24 number of > transactions per client: 12000 number of transactions actually > processed: 288000/288000 tps = 3662.200088 (including connections > establishing) tps = 3664.823769 (excluding connections > establishing) > > > (Nb; Thread here; > http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 ) Fyi, I got my intel x25-m in the mail, and I have been benching it for the past hour or so. Here are some of the rough and ready figures. Note that I don't get anywhere near the vertex benchmark. I did hotplug it and made the filesystem using Theodore Ts'o webpage directions ( http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ ) ; The only thing is, ext3/4 seems to be fixated on a blocksize of 4k, I am wondering if this could be part of the 'problem'. Any ideas/thoughts on tuning gratefully received. Anyway, benchmarks (same system as previously, etc) (ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD) root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 1407.254118 (including connections establishing) tps = 1407.645996 (excluding connections establishing) (ext4dev, 4k block size, everything on SSD) root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 2130.734705 (including connections establishing) tps = 2131.545519 (excluding connections establishing) (I wanted to try and see if random_page_cost dropped down to 2.0, sequential_page_cost = 2.0 would make a difference. Eg; making the planner aware that a random was the same cost as a sequential) root@debian:/var/lib/postgresql/8.3/main# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 1982.481185 (including connections establishing) tps = 1983.223281 (excluding connections establishing) Regards Stef -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknTxccACgkQANG7uQ+9D9XoPgCfRwWwh0jTIs1iDQBVVdQJW/JN CBcAn3zoOO33BnYC/FgmFzw1I+isWvJh =0KYa -----END PGP SIGNATURE-----
On Wed, 1 Apr 2009, Mark Kirkwood wrote: > Scott Carey wrote: >> >> A little extra info here >> md, LVM, and some other tools do not allow the >> file system to use write barriers properly.... So those are on the bad list >> for data integrity with SAS or SATA write caches without battery back-up. >> However, this is NOT an issue on the postgres data partition. Data fsync >> still works fine, its the file system journal that might have out-of-order >> writes. For xlogs, write barriers are not important, only fsync() not >> lying. >> >> As an additional note, ext4 uses checksums per block in the journal, so it >> is resistant to out of order writes causing trouble. The test compared to >> here was on ext4, and most likely the speed increase is partly due to that. >> >> > > [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly > suspicious of such a system being capable of outperforming one with the same > number of (effective) - much faster - disks *plus* a dedicated WAL disk > pair... unless it is being a little loose about fsync! I'm happy to believe > ext4 is better than ext3 - but not that much! given how _horrible_ ext3 is with fsync, I can belive it more easily with fsync turned on than with it off. David Lang > However, its great to have so many different results to compare against! > > Cheers > > Mark > > >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stef Telford wrote: > Stef Telford wrote: >> Mark Kirkwood wrote: >>> Scott Carey wrote: >>>> A little extra info here >> md, LVM, and some other tools do >>>> not allow the file system to use write barriers properly.... >>>> So those are on the bad list for data integrity with SAS or >>>> SATA write caches without battery back-up. However, this is >>>> NOT an issue on the postgres data partition. Data fsync >>>> still works fine, its the file system journal that might have >>>> out-of-order writes. For xlogs, write barriers are not >>>> important, only fsync() not lying. >>>> >>>> As an additional note, ext4 uses checksums per block in the >>>> journal, so it is resistant to out of order writes causing >>>> trouble. The test compared to here was on ext4, and most >>>> likely the speed increase is partly due to that. >>>> >>>> >>> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still >>> highly suspicious of such a system being capable of >>> outperforming one with the same number of (effective) - much >>> faster - disks *plus* a dedicated WAL disk pair... unless it is >>> being a little loose about fsync! I'm happy to believe ext4 is >>> better than ext3 - but not that much! However, its great to >>> have so many different results to compare against! Cheers Mark >> postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24 >> -t 12000 test_db starting vacuum...end. transaction type: TPC-B >> (sort of) scaling factor: 100 number of clients: 24 number of >> transactions per client: 12000 number of transactions actually >> processed: 288000/288000 tps = 3662.200088 (including connections >> establishing) tps = 3664.823769 (excluding connections >> establishing) > > >> (Nb; Thread here; >> http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 ) > Fyi, I got my intel x25-m in the mail, and I have been benching it > for the past hour or so. Here are some of the rough and ready > figures. Note that I don't get anywhere near the vertex benchmark. > I did hotplug it and made the filesystem using Theodore Ts'o > webpage directions ( > http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ > ) ; The only thing is, ext3/4 seems to be fixated on a blocksize > of 4k, I am wondering if this could be part of the 'problem'. Any > ideas/thoughts on tuning gratefully received. > > Anyway, benchmarks (same system as previously, etc) > > (ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD) > > root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 > test_db starting vacuum...end. transaction type: TPC-B (sort of) > scaling factor: 100 number of clients: 24 number of transactions > per client: 12000 number of transactions actually processed: > 288000/288000 tps = 1407.254118 (including connections > establishing) tps = 1407.645996 (excluding connections > establishing) > > (ext4dev, 4k block size, everything on SSD) > > root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 > test_db starting vacuum...end. transaction type: TPC-B (sort of) > scaling factor: 100 number of clients: 24 number of transactions > per client: 12000 number of transactions actually processed: > 288000/288000 tps = 2130.734705 (including connections > establishing) tps = 2131.545519 (excluding connections > establishing) > > (I wanted to try and see if random_page_cost dropped down to 2.0, > sequential_page_cost = 2.0 would make a difference. Eg; making the > planner aware that a random was the same cost as a sequential) > > root@debian:/var/lib/postgresql/8.3/main# > /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting > vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 > number of clients: 24 number of transactions per client: 12000 > number of transactions actually processed: 288000/288000 tps = > 1982.481185 (including connections establishing) tps = 1983.223281 > (excluding connections establishing) > > > Regards Stef Here is the single x25-m SSD, write cache -disabled-, XFS, noatime mounted using the no-op scheduler; stef@debian:~$ sudo /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 1427.781843 (including connections establishing) tps = 1428.137858 (excluding connections establishing) Regards Stef -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknT0hEACgkQANG7uQ+9D9X8zQCfcJ+tRQ7Sh6/YQImPejfZr/h4 /QcAn0hZujC1+f+4tBSF8EhNgR6q44kc =XzG/ -----END PGP SIGNATURE-----
On Wed, 1 Apr 2009, david@lang.hm wrote: > On Wed, 1 Apr 2009, Mark Kirkwood wrote: > >> Scott Carey wrote: >>> >>> A little extra info here >> md, LVM, and some other tools do not allow >>> the >>> file system to use write barriers properly.... So those are on the bad >>> list >>> for data integrity with SAS or SATA write caches without battery back-up. >>> However, this is NOT an issue on the postgres data partition. Data fsync >>> still works fine, its the file system journal that might have out-of-order >>> writes. For xlogs, write barriers are not important, only fsync() not >>> lying. >>> >>> As an additional note, ext4 uses checksums per block in the journal, so it >>> is resistant to out of order writes causing trouble. The test compared to >>> here was on ext4, and most likely the speed increase is partly due to >>> that. >>> >>> >> >> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly >> suspicious of such a system being capable of outperforming one with the >> same number of (effective) - much faster - disks *plus* a dedicated WAL >> disk pair... unless it is being a little loose about fsync! I'm happy to >> believe ext4 is better than ext3 - but not that much! > > given how _horrible_ ext3 is with fsync, I can belive it more easily with > fsync turned on than with it off. I realized after sending this that I needed to elaborate a little more. over the last week there has been a _huge_ thread on the linux-kernel list (>400 messages) that is summarized on lwn.net at http://lwn.net/SubscriberLink/326471/b7f5fedf0f7c545f/ there is a lot of information in this thread, but one big thing is that in data=ordered mode (the default for most distros) ext3 can end up having to write all pending data when you do a fsync on one file, In addition reading from disk can take priority over writing the journal entry (the IO scheduler assumes that there is someone waiting for a read, but not for a write), so if you have one process trying to do a fsync and another reading from the disk, the one doing the fsync needs to wait until the disk is idle to get the fsync completed. ext4 does things enough differently that fsyncs are relativly cheap again (like they are on XFS, ext2, and other filesystems). the tradeoff is that if you _don't_ do an fsync there is a increased window where you will get data corruption if you crash. David Lang
On 4/1/09 10:01 AM, "Matthew Wakeling" <matthew@flymine.org> wrote: > On Wed, 1 Apr 2009, Stef Telford wrote: >> Good UPS, a warm PITR standby, offsite backups and regular checks is >> "good enough" for me, and really, that's what it all comes down to. >> Mitigating risk and factors into an 'acceptable' amount for each person. >> However, if you see over a 2x improvement from turning write-cache 'on' >> and have everything else in place, well, that seems like a 'no-brainer' >> to me, at least ;) > > In that case, buying a battery-backed-up cache in the RAID controller > would be even more of a no-brainer. > > Matthew > Why? Honestly, SATA write cache is safer than a battery backed raid card. The raid card is one more point of failure, and SATA write caches with a modern file system is safe. > -- > If pro is the opposite of con, what is the opposite of progress? > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote: >> Scott Marlowe wrote: >>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: >>> >>>> I do agree that the benefit is probably from write-caching, but I >>>> think that this is a 'win' as long as you have a UPS or BBU adaptor, >>>> and really, in a prod environment, not having a UPS is .. well. Crazy ? >>>> >>> >>> You do know that UPSes can fail, right? En masse sometimes even. >>> >> Hello Scott, >> Well, the only time the UPS has failed in my memory, was during the >> great Eastern Seaboard power outage of 2003. Lots of fond memories >> running around Toronto with a gas can looking for oil for generator >> power. This said though, anything could happen, the co-lo could be taken >> out by a meteor and then sync on or off makes no difference. > > Meteor strike is far less likely than a power surge taking out a UPS. > I saw a whole data center go black when a power conditioner blew out, > taking out the other three power conditioners, both industrial UPSes > and the switch for the diesel generator. And I have friends who have > seen the same type of thing before as well. The data is the most > expensive part of any server. > Yeah, well I¹ve had a RAID card die, which broke its Battery backed cache. They¹re all unsafe, technically. In fact, not only are battery backed caches unsafe, but hard drives. They can return bad data. So if you want to be really safe: 1: don't use Linux -- you have to use something with full data and metadata checksums like ZFS or very expensive proprietary file systems. 2: combine it with mirrored SSD's that don't use write cache (so you can have fsync perf about as good as a battery backed raid card without that risk). 4: keep a live redundant system with a PITR backup at another site that can recover in a short period of time. 3: Run in a datacenter well underground with a plutonium nuclear power supply. Meteor strikes and Nuclear holocaust, beware!
On 4/1/09 9:15 AM, "Stef Telford" <stef@ummon.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Greg Smith wrote: >> On Wed, 1 Apr 2009, Stef Telford wrote: >> >>> I have -explicitly- enabled sync in the conf...In fact, if I turn >>> -off- sync commit, it gets about 200 -slower- rather than >>> faster. >> >> You should take a look at >> http://www.postgresql.org/docs/8.3/static/wal-reliability.html >> >> And check the output from "hdparm -I" as suggested there. If >> turning off fsync doesn't improve your performance, there's almost >> certainly something wrong with your setup. As suggested before, >> your drives probably have write caching turned on. PostgreSQL is >> incapable of knowing that, and will happily write in an unsafe >> manner even if the fsync parameter is turned on. There's a bunch >> more information on this topic at >> http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm >> >> Also: a run to run variation in pgbench results of +/-10% TPS is >> normal, so unless you saw a consistent 200 TPS gain during multiple >> tests my guess is that changing fsync for you is doing nothing, >> rather than you suggestion that it makes things slower. >> > Hello Greg, > Turning off fsync -does- increase the throughput noticeably, > - -however-, turning off synchronous_commit seemed to slow things down > for me. Your right though, when I toggled the sync_commit on the > system, there was a small variation with TPS coming out between 1100 > and 1300. I guess I saw the initial run and thought that there was a > 'loss' in sync_commit = off > > I do agree that the benefit is probably from write-caching, but I > think that this is a 'win' as long as you have a UPS or BBU adaptor, > and really, in a prod environment, not having a UPS is .. well. Crazy ? Write caching on SATA is totally fine. There were some old ATA drives that when paried with some file systems or OS's would not be safe. There are some combinations that have unsafe write barriers. But there is a standard well supported ATA command to sync and only return after the data is on disk. If you are running an OS that is anything recent at all, and any disks that are not really old, you're fine. The notion that current SATA systems are unsafe to have write caching (or SAS for that matter) is not fully informed. You have to pair it with a file system and OS that doesn't issue the necessary cache flush commands to sync.
On 4/1/09 1:44 PM, "Stef Telford" <stef@ummon.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Stef Telford wrote: >> Stef Telford wrote: >> Fyi, I got my intel x25-m in the mail, and I have been benching it >> for the past hour or so. Here are some of the rough and ready >> figures. Note that I don't get anywhere near the vertex benchmark. >> I did hotplug it and made the filesystem using Theodore Ts'o >> webpage directions ( >> http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase- >> block-size/ >> ) ; The only thing is, ext3/4 seems to be fixated on a blocksize >> of 4k, I am wondering if this could be part of the 'problem'. Any >> ideas/thoughts on tuning gratefully received. >> >> Anyway, benchmarks (same system as previously, etc) >> >> (ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD) >> >> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 >> test_db starting vacuum...end. transaction type: TPC-B (sort of) >> scaling factor: 100 number of clients: 24 number of transactions >> per client: 12000 number of transactions actually processed: >> 288000/288000 tps = 1407.254118 (including connections >> establishing) tps = 1407.645996 (excluding connections >> establishing) >> >> (ext4dev, 4k block size, everything on SSD) >> >> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 >> test_db starting vacuum...end. transaction type: TPC-B (sort of) >> scaling factor: 100 number of clients: 24 number of transactions >> per client: 12000 number of transactions actually processed: >> 288000/288000 tps = 2130.734705 (including connections >> establishing) tps = 2131.545519 (excluding connections >> establishing) >> >> (I wanted to try and see if random_page_cost dropped down to 2.0, >> sequential_page_cost = 2.0 would make a difference. Eg; making the >> planner aware that a random was the same cost as a sequential) >> >> root@debian:/var/lib/postgresql/8.3/main# >> /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting >> vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 >> number of clients: 24 number of transactions per client: 12000 >> number of transactions actually processed: 288000/288000 tps = >> 1982.481185 (including connections establishing) tps = 1983.223281 >> (excluding connections establishing) >> >> >> Regards Stef > > Here is the single x25-m SSD, write cache -disabled-, XFS, noatime > mounted using the no-op scheduler; > > stef@debian:~$ sudo /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 > test_db > starting vacuum...end. > transaction type: TPC-B (sort of) > scaling factor: 100 > number of clients: 24 > number of transactions per client: 12000 > number of transactions actually processed: 288000/288000 > tps = 1427.781843 (including connections establishing) > tps = 1428.137858 (excluding connections establishing) Ok, in my experience the next step to better performance on this setup in situations not involving pg_bench is to turn dirty_background_ratio down to a very small number (1 or 2). However, pg_bench relies quite a bit on the OS postponing writes due to its quirkiness. Depending on the scaling factor to memory ratio and how big shared_buffers is, results may vary. So I'm not going to predict that that will help this particular case, but am commenting that in general I have gotten the best throughput and lowest latency with a low dirty_background_ratio and the noop scheduler when using the Intel SSDs. I've tried all the other scheduler and queue tunables, without much result. Increasing max_sectors_kb helped a bit in some cases, but it seemed inconsistent. The Vertex does some things differently that might be very good for postgres (but bad for some other apps) as from what I've seen it prioritizes writes more. Furthermore, it has and uses a write cache from what I've read... The Intel drives don't use a write cache at all (The RAM is for the LBA > Physical map and management). If the vertex is way faster, I would suspect that its write cache may not be properly honoring cache flush commands. I have an app where I wish to keep the read latency as low as possible while doing a large batch write with the write at ~90% disk utilization, and the Intels destroy everything else at that task so far. And in all honesty, I trust the Intel's data integrity a lot more than OCZ for now. > > Regards > Stef > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAknT0hEACgkQANG7uQ+9D9X8zQCfcJ+tRQ7Sh6/YQImPejfZr/h4 > /QcAn0hZujC1+f+4tBSF8EhNgR6q44kc > =XzG/ > -----END PGP SIGNATURE----- > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Wed, 1 Apr 2009, Scott Carey wrote: > On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > >> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote: >>> Scott Marlowe wrote: >>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: >>>> >>>>> I do agree that the benefit is probably from write-caching, but I >>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor, >>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ? >>>>> >>>> >>>> You do know that UPSes can fail, right? En masse sometimes even. >>>> >>> Hello Scott, >>> Well, the only time the UPS has failed in my memory, was during the >>> great Eastern Seaboard power outage of 2003. Lots of fond memories >>> running around Toronto with a gas can looking for oil for generator >>> power. This said though, anything could happen, the co-lo could be taken >>> out by a meteor and then sync on or off makes no difference. >> >> Meteor strike is far less likely than a power surge taking out a UPS. >> I saw a whole data center go black when a power conditioner blew out, >> taking out the other three power conditioners, both industrial UPSes >> and the switch for the diesel generator. And I have friends who have >> seen the same type of thing before as well. The data is the most >> expensive part of any server. >> > Yeah, well I?ve had a RAID card die, which broke its Battery backed cache. > They?re all unsafe, technically. > > In fact, not only are battery backed caches unsafe, but hard drives. They > can return bad data. So if you want to be really safe: > > 1: don't use Linux -- you have to use something with full data and metadata > checksums like ZFS or very expensive proprietary file systems. this will involve other tradeoffs > 2: combine it with mirrored SSD's that don't use write cache (so you can > have fsync perf about as good as a battery backed raid card without that > risk). they _all_ have write caches. a beast like you are looking for doesn't exist > 4: keep a live redundant system with a PITR backup at another site that can > recover in a short period of time. a good option to keep in mind (and when the new replication code becomes available, that will be even better) > 3: Run in a datacenter well underground with a plutonium nuclear power > supply. Meteor strikes and Nuclear holocaust, beware! at some point all that will fail but you missed point #5 (in many ways a more important point than the others that you describe) switch from using postgres to using a database that can do two-phase commits across redundant machines so that you know the data is safe on multiple systems before the command is considered complete. David Lang
On Wed, Apr 1, 2009 at 4:15 PM, Scott Carey <scott@richrelevance.com> wrote: > > On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > >> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote: >>> Scott Marlowe wrote: >>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote: >>>> >>>>> I do agree that the benefit is probably from write-caching, but I >>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor, >>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ? >>>>> >>>> >>>> You do know that UPSes can fail, right? En masse sometimes even. >>>> >>> Hello Scott, >>> Well, the only time the UPS has failed in my memory, was during the >>> great Eastern Seaboard power outage of 2003. Lots of fond memories >>> running around Toronto with a gas can looking for oil for generator >>> power. This said though, anything could happen, the co-lo could be taken >>> out by a meteor and then sync on or off makes no difference. >> >> Meteor strike is far less likely than a power surge taking out a UPS. >> I saw a whole data center go black when a power conditioner blew out, >> taking out the other three power conditioners, both industrial UPSes >> and the switch for the diesel generator. And I have friends who have >> seen the same type of thing before as well. The data is the most >> expensive part of any server. >> > Yeah, well I¹ve had a RAID card die, which broke its Battery backed cache. > They¹re all unsafe, technically. That's why you use two controllers with mirror sets across them and them RAID-0 across the top. But I know what you mean. Now the mobo and memory are the single point of failure. Next stop, sequent etc. > In fact, not only are battery backed caches unsafe, but hard drives. They > can return bad data. So if you want to be really safe: > > 1: don't use Linux -- you have to use something with full data and metadata > checksums like ZFS or very expensive proprietary file systems. You'd better be running them on sequent or Sysplex mainframe type hardware. > 4: keep a live redundant system with a PITR backup at another site that can > recover in a short period of time. > 3: Run in a datacenter well underground with a plutonium nuclear power > supply. Meteor strikes and Nuclear holocaust, beware! Pleaze, such hyperbole! Everyone know it can run on uranium just as well. I'm sure these guys: http://royal.pingdom.com/2008/11/14/the-worlds-most-super-designed-data-center-fit-for-a-james-bond-villain/ can sort that out for you.
Stef Telford wrote: > > Hello Mark, > For the record, this is a 'base' debian 5 install (with openVZ but > postgreSQL is running on the base hardware, not inside a container) > and I have -explicitly- enabled sync in the conf. Eg; > > > fsync = on # turns forced > > > Infact, if I turn -off- sync commit, it gets about 200 -slower- > rather than faster. > Sorry Stef - didn't mean to doubt you....merely your disks! Cheers Mark
Greg Smith wrote: > >> Yeah - with 64K chunksize I'm seeing a result more congruent with >> yours (866 or so for 24 clients) > > That's good to hear. If adjusting that helped so much, you might > consider aligning the filesystem partitions to the chunk size too; the > partition header usually screws that up on Linux. See these two > references for ideas: > http://www.vmware.com/resources/techresources/608 > http://spiralbound.net/2008/06/09/creating-linux-partitions-for-clariion > Well I went away and did this (actually organized for for the system folks to...). Retesting showed no appreciable difference (if anything slower). Then I got to thinking: For a partition created on a (hardware) raided device, sure - alignment is very important, however in my case we are using software (md) raid - which creates devices out of individual partitions (which are on individual SAS disks) e.g: md3 : active raid10 sda4[0] sdd4[3] sdc4[2] sdb4[1] 177389056 blocks 256K chunks 2 near-copies [4/4] [UUUU] I'm thinking that alignment issues do not apply here, as md will allocate chunks starting at the beginning of wherever sda4 (etc) begins - so the absolute starting position of sda4 is irrelevant. Or am I missing something? Thanks again Mark
On Wed, 1 Apr 2009, Scott Carey wrote: > Write caching on SATA is totally fine. There were some old ATA drives that > when paried with some file systems or OS's would not be safe. There are > some combinations that have unsafe write barriers. But there is a standard > well supported ATA command to sync and only return after the data is on > disk. If you are running an OS that is anything recent at all, and any > disks that are not really old, you're fine. While I would like to believe this, I don't trust any claims in this area that don't have matching tests that demonstrate things working as expected. And I've never seen this work. My laptop has a 7200 RPM drive, which means that if fsync is being passed through to the disk correctly I can only fsync <120 times/second. Here's what I get when I run sysbench on it, starting with the default ext3 configuration: $ uname -a Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC 2009 i686 GNU/Linux $ mount /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro) $ sudo hdparm -I /dev/sda | grep FLUSH * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwrrun sysbench v0.4.8: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 16Kb each 16Kb total file size Block size 16Kb Number of random requests for random IO: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Done. Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec) 2507.29 Requests/sec executed OK, that's clearly cached writes where the drive is lying about fsync. The claim is that since my drive supports both the flush calls, I just need to turn on barrier support, right? [Edit /etc/fstab to remount with barriers] $ mount /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1) [sysbench again] 2612.74 Requests/sec executed ----- This is basically how this always works for me: somebody claims barriers and/or SATA disks work now, no really this time. I test, they give answers that aren't possible if fsync were working properly, I conclude turning off the write cache is just as necessary as it always was. If you can suggest something wrong with how I'm testing here, I'd love to hear about it. I'd like to believe you but I can't seem to produce any evidence that supports you claims here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote: > On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: >> On Wed, 25 Mar 2009, Mark Kirkwood wrote: >>> I'm thinking that the raid chunksize may well be the issue. >> >> Why? I'm not saying you're wrong, I just don't see why that parameter >> jumped out as a likely cause here. >> > > If postgres is random reading or writing at 8k block size, and the raid > array is set with 4k block size, then every 8k random i/o will create TWO > disk seeks since it gets split to two disks. Effectively, iops will be cut > in half. I disagree. The 4k raid chunks are likely to be grouped together on disk and read sequentially. This will only give two seeks in special cases. Now, if the PostgreSQL block size is _smaller_ than the raid chunk size, random writes can get expensive (especially for raid 5) because the raid chunk has to be fully read in and written back out. But this is mainly a theoretical problem I think. I'm going to go out on a limb and say that for block sizes that are within one or two 'powers of two' of each other, it doesn't matter a whole lot. SSDs might be different, because of the 'erase' block which might be 128k, but I bet this is dealt with in such a fashion that you wouldn't really notice it when dealing with different block sizes in pg. merlin
Greg Smith wrote: > OK, that's clearly cached writes where the drive is lying about fsync. > The claim is that since my drive supports both the flush calls, I just > need to turn on barrier support, right? > That's a big pointy finger you are aiming at that drive - are you sure it was sent the flush instruction? Clearly *something* isn't right. > This is basically how this always works for me: somebody claims > barriers and/or SATA disks work now, no really this time. I test, > they give answers that aren't possible if fsync were working properly, > I conclude turning off the write cache is just as necessary as it > always was. If you can suggest something wrong with how I'm testing > here, I'd love to hear about it. I'd like to believe you but I can't > seem to produce any evidence that supports you claims here. Try similar tests with Solaris and Vista? (Might have to give the whole disk to ZFS with Solaris to give it confidence to enable write cache, which mioght not be easy with a laptop boot drive: XP and Vista should show the toggle on the drive) James
On 4/2/09 10:58 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote: >> On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: >>> On Wed, 25 Mar 2009, Mark Kirkwood wrote: >>>> I'm thinking that the raid chunksize may well be the issue. >>> >>> Why? I'm not saying you're wrong, I just don't see why that parameter >>> jumped out as a likely cause here. >>> >> >> If postgres is random reading or writing at 8k block size, and the raid >> array is set with 4k block size, then every 8k random i/o will create TWO >> disk seeks since it gets split to two disks. Effectively, iops will be cut >> in half. > > I disagree. The 4k raid chunks are likely to be grouped together on > disk and read sequentially. This will only give two seeks in special > cases. By definition, adjacent raid blocks in a stripe are on different disks. > Now, if the PostgreSQL block size is _smaller_ than the raid > chunk size, random writes can get expensive (especially for raid 5) > because the raid chunk has to be fully read in and written back out. > But this is mainly a theoretical problem I think. This is false and a RAID-5 myth. New parity can be constructed from the old parity + the change in data. Only 2 blocks have to be accessed, not the whole stripe. Plus, this was about RAID 10 or 0 where parity does not apply. > > I'm going to go out on a limb and say that for block sizes that are > within one or two 'powers of two' of each other, it doesn't matter a > whole lot. SSDs might be different, because of the 'erase' block > which might be 128k, but I bet this is dealt with in such a fashion > that you wouldn't really notice it when dealing with different block > sizes in pg. Well, raid block size can be significantly larger than postgres or file system block size and the performance of random reads / writes won't get worse with larger block sizes. This holds only for RAID 0 (or 10), parity is the ONLY thing that makes larger block sizes bad since there is a read-modify-write type operation on something the size of one block. Raid block sizes smaller than the postgres block is always bad and multiplies random i/o. Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one disk. Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from two disks.
On Thu, Apr 2, 2009 at 4:20 PM, Scott Carey <scott@richrelevance.com> wrote: > > On 4/2/09 10:58 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > >> On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote: >>> On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: >>>> On Wed, 25 Mar 2009, Mark Kirkwood wrote: >>>>> I'm thinking that the raid chunksize may well be the issue. >>>> >>>> Why? I'm not saying you're wrong, I just don't see why that parameter >>>> jumped out as a likely cause here. >>>> >>> >>> If postgres is random reading or writing at 8k block size, and the raid >>> array is set with 4k block size, then every 8k random i/o will create TWO >>> disk seeks since it gets split to two disks. Effectively, iops will be cut >>> in half. >> >> I disagree. The 4k raid chunks are likely to be grouped together on >> disk and read sequentially. This will only give two seeks in special >> cases. > > By definition, adjacent raid blocks in a stripe are on different disks. > > >> Now, if the PostgreSQL block size is _smaller_ than the raid >> chunk size, random writes can get expensive (especially for raid 5) >> because the raid chunk has to be fully read in and written back out. >> But this is mainly a theoretical problem I think. > > This is false and a RAID-5 myth. New parity can be constructed from the old > parity + the change in data. Only 2 blocks have to be accessed, not the > whole stripe. > > Plus, this was about RAID 10 or 0 where parity does not apply. > >> >> I'm going to go out on a limb and say that for block sizes that are >> within one or two 'powers of two' of each other, it doesn't matter a >> whole lot. SSDs might be different, because of the 'erase' block >> which might be 128k, but I bet this is dealt with in such a fashion >> that you wouldn't really notice it when dealing with different block >> sizes in pg. > > Well, raid block size can be significantly larger than postgres or file > system block size and the performance of random reads / writes won't get > worse with larger block sizes. This holds only for RAID 0 (or 10), parity > is the ONLY thing that makes larger block sizes bad since there is a > read-modify-write type operation on something the size of one block. > > Raid block sizes smaller than the postgres block is always bad and > multiplies random i/o. > > Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one > disk. > Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from > two disks. yep...that's good analysis...thinko on my part. merlin
On 4/2/09 1:53 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: > On Wed, 1 Apr 2009, Scott Carey wrote: > >> Write caching on SATA is totally fine. There were some old ATA drives that >> when paried with some file systems or OS's would not be safe. There are >> some combinations that have unsafe write barriers. But there is a standard >> well supported ATA command to sync and only return after the data is on >> disk. If you are running an OS that is anything recent at all, and any >> disks that are not really old, you're fine. > > While I would like to believe this, I don't trust any claims in this area > that don't have matching tests that demonstrate things working as > expected. And I've never seen this work. > > My laptop has a 7200 RPM drive, which means that if fsync is being passed > through to the disk correctly I can only fsync <120 times/second. Here's > what I get when I run sysbench on it, starting with the default ext3 > configuration: > > $ uname -a > Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC > 2009 i686 GNU/Linux > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro) > > $ sudo hdparm -I /dev/sda | grep FLUSH > * Mandatory FLUSH_CACHE > * FLUSH_CACHE_EXT > > $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 > --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run > sysbench v0.4.8: multi-threaded system evaluation benchmark > > Running the test with following options: > Number of threads: 1 > > Extra file open flags: 0 > 1 files, 16Kb each > 16Kb total file size > Block size 16Kb > Number of random requests for random IO: 10000 > Read/Write ratio for combined random IO test: 1.50 > Periodic FSYNC enabled, calling fsync() each 1 requests. > Calling fsync() at the end of test, Enabled. > Using synchronous I/O mode > Doing random write test > Threads started! > Done. > > Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total > Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec) > 2507.29 Requests/sec executed > > > OK, that's clearly cached writes where the drive is lying about fsync. > The claim is that since my drive supports both the flush calls, I just > need to turn on barrier support, right? > > [Edit /etc/fstab to remount with barriers] > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1) > > [sysbench again] > > 2612.74 Requests/sec executed > > ----- > > This is basically how this always works for me: somebody claims barriers > and/or SATA disks work now, no really this time. I test, they give > answers that aren't possible if fsync were working properly, I conclude > turning off the write cache is just as necessary as it always was. If you > can suggest something wrong with how I'm testing here, I'd love to hear > about it. I'd like to believe you but I can't seem to produce any > evidence that supports you claims here. Your data looks good, and puts a lot of doubt on my previous sources of info. So I did more research, it seems that (most) drives don't lie, your OS and file system do (or sometimes drive drivers or raid card). I know LVM and MD and other Linux block remapping layer things break write barriers as well. Apparently ext3 doesn't implement fsync with a write barrier or cache flush. Linux kernel mailing lists implied that 2.6 had fixed these, but apparently not. Write barriers were fixed, but not fsync. Even more confusing, it looks like the behavior in some linux versions that are highly patched and backported (SUSE, RedHat, mostly) may behave differently than those closer to the kernel trunk like Ubuntu. If you can, try xfs with write barriers on. I'll try some tests using FIO (not familiar with sysbench but looks easy too) with various file systems and some SATA and SAS/SCSI setups when I get a chance. A lot of my prior evidence came from the linux kernel list and other places where I trusted the info over the years. I'll dig up more. But here is what I've learned in the past plus a bit from today: Drives don't lie anymore, and write barrier and lower level ATA commands just work. Linux fixed write barrier support in kernel 2.5. Several OS's do things right and many don't with respect to fsync. I had thought linux did fix this but it turns out they only fixed write barriers and left fsync broken: http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024/thread In your tests the barriers slowed things down a lot, so something is working right there. From what I can see, with ext3 metadata changes cause much more frequent write barrier activity, so 'relatime' and 'noatime' actually HURT your data integrity as a side effect of fsync not guaranteeing what you think it does. The big one, is this quote from the linux kernel list: " Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. " Check this full post out that started that thread: http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024 I admit that it looks like I'm pretty wrong for Linux with ext3 at the least. Linux is often not safe with disk write caches because its fsync() call doesn't flush the cache. The root problem, is not the drives, its linux / ext3. Its write-barrier support is fine now (if you don't go through LVM or MD which don't support it), but fsync does not guarantee anything other than the write having left the OS and gone to the device. In fact POSIX fsync(2) doesn't require that the data is on disk. Interestingly, postgres would be safer on linux if it used sync_file_range instead of fsync() but that has other drawbacks and limitations -- and is broken by use of LVM or MD. Currently, linux + ext3 + postgres, you are only guaranteed when fsync() returns that the data has left the OS, not that it is on a drive -- SATA or SAS. Strangely, sync_file_range() is safer than fsync() in the presence of any drive cache at all (including battery backed raid card failure) because it at least enforces write barriers. Fsync + SATA write cache is safe on Solaris with ZFS, but not Solaris with UFS (the file system is write barrier and cache aware for the former and not the latter). Linux (a lot) and Postgres (a little) can learn from some of the ZFS concepts with regard to atomicity of changes and checksums on data and metadata. Much of the above issues would simply not exist in the presence of good checksum use. Ext4 has journal segment checksums, but no metadata or data checksums exist for ability to detect partial writes to anything but the journal. Postgres is adding checksums on data, and is already essentially copy-on-write for MVCC which is awesome -- are xlog writes protected by checksums? Accidental out-of-order writes become an issue that can be dealt with in a log or journal that has checksums even in the presence of OS and File Systems that don't have good guarantees for fsync like Linux + ext3. Postgres could make itself safe even if drive write cache is enabled, fsync lies, AND there is a power failure. If I'm not mistaken, block checksums on data + xlog entry checksums can make it very difficult to corrupt even if fsync is off (though data writes happening before xlog writes are still bad -- that would require external-to-block checksums --like zfs -- to fix)! http://lkml.org/lkml/2005/5/15/85 Where the "disks lie to you" stuff probably came from: http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252&tid=198&tid=128 (turns out its the OS that isn't flushing the cache on fsync). http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache _on_journaled_filesystems.3F So if xfs fsync has a barrier, its safe with either: Raw device that respects cache flush + write caching on. OR Battery backed raid card + drive write caching off. Xfs fsync supposedly works right (need to test) but fdatasync() does not. What this really boils down to is that POSIX fsync does not provide a guarantee that the data is on disk at all. My previous comments are wrong. This means that fsync protects you from OS crashes, but not power failure. It can do better in some systems / implementations. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD >
On 4/2/09 1:20 PM, "Scott Carey" <scott@richrelevance.com> wrote: > > Well, raid block size can be significantly larger than postgres or file > system block size and the performance of random reads / writes won't get > worse with larger block sizes. This holds only for RAID 0 (or 10), parity > is the ONLY thing that makes larger block sizes bad since there is a > read-modify-write type operation on something the size of one block. > > Raid block sizes smaller than the postgres block is always bad and > multiplies random i/o. > > Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one > disk. > Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from > two disks. > OK, one more thing. The 8k read In a 8MB block size raid array can generate two reads in the following cases: Your read is on the boundary of the blocks AND 1: your partition is not aligned with the raid blocks. This can happen if you partition _inside_ the raid but not if you raid inside the partition (the latter only being applicable to software raid). OR 2: your file system block size is smaller than the postgres block size and the file block offset is not postgres block aligned. The likelihood of the first condition is proportional to: (Postgres block size)/(raid block size) Hence, for most all setups with software raid, a larger block size up to the point where the above ratio gets sufficiently small is optimal. If the block size gets too large, then random access is more and more likely to bias towards one drive over the others and lower throughput. Obviously, in the extreme case where the block size is the disk size, you would have to randomly access 100% of all the data to get full speed.
Greg Smith wrote: > On Wed, 1 Apr 2009, Scott Carey wrote: > >> Write caching on SATA is totally fine. There were some old ATA drives >> that when paried with some file systems or OS's would not be safe. There are >> some combinations that have unsafe write barriers. But there is a >> standard >> well supported ATA command to sync and only return after the data is on >> disk. If you are running an OS that is anything recent at all, and any >> disks that are not really old, you're fine. > > While I would like to believe this, I don't trust any claims in this > area that don't have matching tests that demonstrate things working as > expected. And I've never seen this work. > > My laptop has a 7200 RPM drive, which means that if fsync is being > passed through to the disk correctly I can only fsync <120 > times/second. Here's what I get when I run sysbench on it, starting > with the default ext3 configuration: I believe it's ext3 who's cheating in this scenario. Any chance you can test the program I posted here that tweaks the inode before the fsync: http://archives.postgresql.org//pgsql-general/2009-03/msg00703.php On my system with the fchmod's in that program I was getting one fsync per disk revolution. Without the fchmod's, fsync() didn't wait at all. This was the case on dozens of drives I tried, dating back to old PATA drives from 2000. Only drives from last century didn't behave that way - but I can't accuse them of lying because hdparm showed that they didn't claim to support FLUSH_CACHE. I think this program shows that practically all hard drives are physically capable of doing a proper fsync; but annoyingly ext3 refuses to send the FLUSH_CACHE commands to the drive unless the inode changed. > $ uname -a > Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 > UTC 2009 i686 GNU/Linux > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro) > > $ sudo hdparm -I /dev/sda | grep FLUSH > * Mandatory FLUSH_CACHE > * FLUSH_CACHE_EXT > > $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 > --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run > sysbench v0.4.8: multi-threaded system evaluation benchmark > > Running the test with following options: > Number of threads: 1 > > Extra file open flags: 0 > 1 files, 16Kb each > 16Kb total file size > Block size 16Kb > Number of random requests for random IO: 10000 > Read/Write ratio for combined random IO test: 1.50 > Periodic FSYNC enabled, calling fsync() each 1 requests. > Calling fsync() at the end of test, Enabled. > Using synchronous I/O mode > Doing random write test > Threads started! > Done. > > Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total > Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec) > 2507.29 Requests/sec executed > > > OK, that's clearly cached writes where the drive is lying about fsync. > The claim is that since my drive supports both the flush calls, I just > need to turn on barrier support, right? > > [Edit /etc/fstab to remount with barriers] > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1) > > [sysbench again] > > 2612.74 Requests/sec executed > > ----- > > This is basically how this always works for me: somebody claims > barriers and/or SATA disks work now, no really this time. I test, they > give answers that aren't possible if fsync were working properly, I > conclude turning off the write cache is just as necessary as it always > was. If you can suggest something wrong with how I'm testing here, I'd > love to hear about it. I'd like to believe you but I can't seem to > produce any evidence that supports you claims here. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD >
Ron Mayer wrote: > Greg Smith wrote: >> On Wed, 1 Apr 2009, Scott Carey wrote: >> >>> Write caching on SATA is totally fine. There were some old ATA drives >>> that when paried with some file systems or OS's would not be safe. There are >>> some combinations that have unsafe write barriers. But there is a >>> standard >>> well supported ATA command to sync and only return after the data is on >>> disk. If you are running an OS that is anything recent at all, and any >>> disks that are not really old, you're fine. >> While I would like to believe this, I don't trust any claims in this >> area that don't have matching tests that demonstrate things working as >> expected. And I've never seen this work. >> >> My laptop has a 7200 RPM drive, which means that if fsync is being >> passed through to the disk correctly I can only fsync <120 >> times/second. Here's what I get when I run sysbench on it, starting >> with the default ext3 configuration: > > I believe it's ext3 who's cheating in this scenario. I assume so too. Here the same test using XFS, first with barriers (XFS default) and then without: Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux /dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0 # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 16Kb each 16Kb total file size Block size 16Kb Number of random requests for random IO: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Done. Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (463.9Kb/sec) 28.99 Requests/sec executed Test execution summary: total time: 344.9013s total number of events: 10000 total time taken by event execution: 0.1453 per-request statistics: min: 0.01ms avg: 0.01ms max: 0.07ms approx. 95 percentile: 0.01ms Threads fairness: events (avg/stddev): 10000.0000/0.00 execution time (avg/stddev): 0.1453/0.00 And now without barriers: /dev/sdb /data2 xfs rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0 # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 16Kb each 16Kb total file size Block size 16Kb Number of random requests for random IO: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Done. Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (62.872Mb/sec) 4023.81 Requests/sec executed Test execution summary: total time: 2.4852s total number of events: 10000 total time taken by event execution: 0.1325 per-request statistics: min: 0.01ms avg: 0.01ms max: 0.06ms approx. 95 percentile: 0.01ms Threads fairness: events (avg/stddev): 10000.0000/0.00 execution time (avg/stddev): 0.1325/0.00 -- Best regards, Hannes Dorbath
Mark Kirkwood wrote: > Rebuilt with 256K chunksize: > > transaction type: TPC-B (sort of) > scaling factor: 100 > number of clients: 24 > number of transactions per client: 12000 > number of transactions actually processed: 288000/288000 > tps = 942.852104 (including connections establishing) > tps = 943.019223 (excluding connections establishing) > Increasing checkpoint_segments to 96 and decreasing bgwriter_lru_maxpages to 100: transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 24 number of transactions per client: 12000 number of transactions actually processed: 288000/288000 tps = 1219.221721 (including connections establishing) tps = 1219.501150 (excluding connections establishing) ... as suggested by Greg (actually he suggested reducing bgwriter_lru_maxpages to 0, but this seemed to be no better). Anyway, seeing quite a reasonable improvement (about 83% from where we started). It will be interesting to see how/if the improvements measured in pgbench translate into the "real" application. Thanks for all your help (particularly to both Scotts, Greg and Stef). regards Mark
Hannes sent this off-list, presumably via newsgroup, and it's certainly worth sharing. I've always been scared off of using XFS because of the problems outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf too (I'm finding that old message with Ted saying "Making sure you don't lose data is Job #1" hilarious right now, consider the recent ext4 data loss debacle) ---------- Forwarded message ---------- Date: Fri, 3 Apr 2009 10:19:38 +0200 From: Hannes Dorbath <light@theendofthetunnel.de> Newsgroups: pgsql.performance Subject: Re: [PERFORM] Raid 10 chunksize Ron Mayer wrote: > Greg Smith wrote: >> On Wed, 1 Apr 2009, Scott Carey wrote: >> >>> Write caching on SATA is totally fine. There were some old ATA drives >>> that when paried with some file systems or OS's would not be safe. There >>> are >>> some combinations that have unsafe write barriers. But there is a >>> standard >>> well supported ATA command to sync and only return after the data is on >>> disk. If you are running an OS that is anything recent at all, and any >>> disks that are not really old, you're fine. >> While I would like to believe this, I don't trust any claims in this >> area that don't have matching tests that demonstrate things working as >> expected. And I've never seen this work. >> >> My laptop has a 7200 RPM drive, which means that if fsync is being >> passed through to the disk correctly I can only fsync <120 >> times/second. Here's what I get when I run sysbench on it, starting >> with the default ext3 configuration: > > I believe it's ext3 who's cheating in this scenario. I assume so too. Here the same test using XFS, first with barriers (XFS default) and then without: Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux /dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0 # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 16Kb each 16Kb total file size Block size 16Kb Number of random requests for random IO: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Done. Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (463.9Kb/sec) 28.99 Requests/sec executed Test execution summary: total time: 344.9013s total number of events: 10000 total time taken by event execution: 0.1453 per-request statistics: min: 0.01ms avg: 0.01ms max: 0.07ms approx. 95 percentile: 0.01ms Threads fairness: events (avg/stddev): 10000.0000/0.00 execution time (avg/stddev): 0.1453/0.00 And now without barriers: /dev/sdb /data2 xfs rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0 # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 16Kb each 16Kb total file size Block size 16Kb Number of random requests for random IO: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Done. Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (62.872Mb/sec) 4023.81 Requests/sec executed Test execution summary: total time: 2.4852s total number of events: 10000 total time taken by event execution: 0.1325 per-request statistics: min: 0.01ms avg: 0.01ms max: 0.06ms approx. 95 percentile: 0.01ms Threads fairness: events (avg/stddev): 10000.0000/0.00 execution time (avg/stddev): 0.1325/0.00 -- Best regards, Hannes Dorbath
On Thu, 2 Apr 2009, James Mansion wrote: > Might have to give the whole disk to ZFS with Solaris to give it > confidence to enable write cache Confidence, sure, but not necessarily performance at the same time. The ZFS Kool-Aid gets bitter sometimes too, and I worry that its reputation causes people to just trust it when they should be wary. If there's anything this thread does, I hope it helps demonstrate how easy it is to discover reality doesn't match expectations at all in this very messy area. Trust No One! Keep Your Laser Handy! There's a summary of the expected happy ZFS actions at http://www.opensolaris.org/jive/thread.jspa?messageID=19264& and a good cautionary tale of unhappy ZFS behavior in this area at http://blogs.digitar.com/jjww/2006/12/shenanigans-with-zfs-flushing-and-intelligent-arrays/ and its follow-up http://blogs.digitar.com/jjww/2007/10/back-in-the-sandbox-zfs-flushing-shenanigans-revisted/ Systems with a hardware write cache are pretty common on this list, which makes the situation described there not that unlikely to run into. The official word here is at http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 2 Apr 2009, Scott Carey wrote: > The big one, is this quote from the linux kernel list: > " Right now, if you want a reliable database on Linux, you _cannot_ > properly depend on fsync() or fdatasync(). Considering how much Linux > is used for critical databases, using these functions, this amazes me. > " Things aren't as bad as that out of context quote makes them seem. There are two main problem situations here: 1) You cannot trust Linux to flush data to a hard drive's write cache. Solution: turn off the write cache. Given the general poor state of targeted fsync on Linux (quoting from a downthread comment by David Lang: "in data=ordered mode, the default for most distros, ext3 can end up having to write all pending data when you do a fsync on one file"), those fsyncs were likely to blow out the drive cache anyway. 2) There are no hard guarantees about write ordering at the disk level; if you write blocks ABC and then fsync, you might actually get, say, only B written before power goes out. I don't believe the PostgreSQL WAL design will be corrupted by this particular situation, because until that fsync comes back saying all 3 are done none of them are relied upon. > Interestingly, postgres would be safer on linux if it used > sync_file_range instead of fsync() but that has other drawbacks and > limitations I have thought about whether it would be possible to add a Linux-specific improvement here into the code path that does something custom in this area for Windows/Mac OS X when you use fsync_method=fsync_writethrough We really should update the documentation in this area before 8.4 ships. I'm looking into moving the "Tuning PostgreSQL WAL Synchronization" paper I wrote onto the wiki and then fleshing it out with all this filesystem-specific trivia. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 3 Apr 2009, Greg Smith wrote: > Hannes sent this off-list, presumably via newsgroup, and it's certainly worth > sharing. I've always been scared off of using XFS because of the problems > outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more > testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf > too > > (I'm finding that old message with Ted saying "Making sure you don't lose > data is Job #1" hilarious right now, consider the recent ext4 data loss > debacle) also note that the message from Ted was back in 2004, there has been a _lot_ of work done on XFS in the last 4 years. as for the second link, that focuses on what happens to the filesystem if the disk under it starts returning errors or garbage. with the _possible_ exception of ZFS, every filesystem around will do strange things under those conditions. and in my option, the way to deal with this sort of thing isn't to move to ZFS to detect the problem, it's to setup redundancy in your storage so that you can not only detect the problem, but correct it as well (it's a good thing to know that your database file is corrupt, but that's not nearly as useful as having some way to recover the data that was there) David Lang > ---------- Forwarded message ---------- > Date: Fri, 3 Apr 2009 10:19:38 +0200 > From: Hannes Dorbath <light@theendofthetunnel.de> > Newsgroups: pgsql.performance > Subject: Re: [PERFORM] Raid 10 chunksize > > Ron Mayer wrote: >> Greg Smith wrote: >>> On Wed, 1 Apr 2009, Scott Carey wrote: >>> >>>> Write caching on SATA is totally fine. There were some old ATA drives >>>> that when paried with some file systems or OS's would not be safe. There >>>> are >>>> some combinations that have unsafe write barriers. But there is a >>>> standard >>>> well supported ATA command to sync and only return after the data is on >>>> disk. If you are running an OS that is anything recent at all, and any >>>> disks that are not really old, you're fine. >>> While I would like to believe this, I don't trust any claims in this >>> area that don't have matching tests that demonstrate things working as >>> expected. And I've never seen this work. >>> >>> My laptop has a 7200 RPM drive, which means that if fsync is being >>> passed through to the disk correctly I can only fsync <120 >>> times/second. Here's what I get when I run sysbench on it, starting >>> with the default ext3 configuration: >> >> I believe it's ext3 who's cheating in this scenario. > > I assume so too. Here the same test using XFS, first with barriers (XFS > default) and then without: > > Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz > GenuineIntel GNU/Linux > > /dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0 > > # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 > --file-total-size=16384 --file-test-mode=rndwr run > sysbench 0.4.10: multi-threaded system evaluation benchmark > > Running the test with following options: > Number of threads: 1 > > Extra file open flags: 0 > 1 files, 16Kb each > 16Kb total file size > Block size 16Kb > Number of random requests for random IO: 10000 > Read/Write ratio for combined random IO test: 1.50 > Periodic FSYNC enabled, calling fsync() each 1 requests. > Calling fsync() at the end of test, Enabled. > Using synchronous I/O mode > Doing random write test > Threads started! > Done. > > Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total > Read 0b Written 156.25Mb Total transferred 156.25Mb (463.9Kb/sec) > 28.99 Requests/sec executed > > Test execution summary: > total time: 344.9013s > total number of events: 10000 > total time taken by event execution: 0.1453 > per-request statistics: > min: 0.01ms > avg: 0.01ms > max: 0.07ms > approx. 95 percentile: 0.01ms > > Threads fairness: > events (avg/stddev): 10000.0000/0.00 > execution time (avg/stddev): 0.1453/0.00 > > > And now without barriers: > > /dev/sdb /data2 xfs > rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0 > > # sysbench --test=fileio --file-fsync-freq=1 --file-num=1 > --file-total-size=16384 --file-test-mode=rndwr run > sysbench 0.4.10: multi-threaded system evaluation benchmark > > Running the test with following options: > Number of threads: 1 > > Extra file open flags: 0 > 1 files, 16Kb each > 16Kb total file size > Block size 16Kb > Number of random requests for random IO: 10000 > Read/Write ratio for combined random IO test: 1.50 > Periodic FSYNC enabled, calling fsync() each 1 requests. > Calling fsync() at the end of test, Enabled. > Using synchronous I/O mode > Doing random write test > Threads started! > Done. > > Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total > Read 0b Written 156.25Mb Total transferred 156.25Mb (62.872Mb/sec) > 4023.81 Requests/sec executed > > Test execution summary: > total time: 2.4852s > total number of events: 10000 > total time taken by event execution: 0.1325 > per-request statistics: > min: 0.01ms > avg: 0.01ms > max: 0.06ms > approx. 95 percentile: 0.01ms > > Threads fairness: > events (avg/stddev): 10000.0000/0.00 > execution time (avg/stddev): 0.1325/0.00 > > >
On Fri, 3 Apr 2009, david@lang.hm wrote: > also note that the message from Ted was back in 2004, there has been a _lot_ > of work done on XFS in the last 4 years. Sure, I know they've made progress, which is why I didn't also bring up older ugly problems like delayed allocation issues reducing files to zero length on XFS. I thought that particular issue was pretty fundamental to the logical journal scheme XFS is based on. What's you'll get out of disk I/O at smaller than the block level is pretty unpredictable when there's a failure. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 4/3/09 6:05 PM, "david@lang.hm" <david@lang.hm> wrote: > On Fri, 3 Apr 2009, Greg Smith wrote: > >> Hannes sent this off-list, presumably via newsgroup, and it's certainly worth >> sharing. I've always been scared off of using XFS because of the problems >> outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more >> testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf >> too >> >> (I'm finding that old message with Ted saying "Making sure you don't lose >> data is Job #1" hilarious right now, consider the recent ext4 data loss >> debacle) > > also note that the message from Ted was back in 2004, there has been a > _lot_ of work done on XFS in the last 4 years. > > as for the second link, that focuses on what happens to the filesystem if > the disk under it starts returning errors or garbage. with the _possible_ > exception of ZFS, every filesystem around will do strange things under > those conditions. and in my option, the way to deal with this sort of > thing isn't to move to ZFS to detect the problem, it's to setup redundancy > in your storage so that you can not only detect the problem, but correct > it as well (it's a good thing to know that your database file is corrupt, > but that's not nearly as useful as having some way to recover the data > that was there) Not trying to spread too much kool-aid around, but ZFS does that. If a mirror set (which might be 2, 3 or more copies in the mirror) detects a checksum error, it reads the other copies and attempts to correct the bad block. PLUS, the performance under normal conditions for reads scales with the mirrors. 12 disks in raid 10 do writes as fast as 6 disk raid 0, but reads as fast as 12 disk raid 0 since it does not have to read both mirror sets to detect an error, only to recover. You can even just write zeros to random spots in a mirror and it will throw errors and use the other copies. This really isn't a ZFS promotion, rather its a promotion of the power of checksums at the file system and raid level. A hardware raid card could just as well sacrifice some space to place checksums on its blocks and get much the same result. > > David Lang >