Thread: RAID stripe size question
I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3.
Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never "searched" during normal operation). And for disks holding the data, a larger stripe size (>32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway.
So, does this make sense? Has anyone tried it and seen any performance gains from it?
Regards,
Mikael.
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: > Now to the interesting part: would it make sense to use different stripe > sizes on the separate disk arrays? In theory, a smaller stripe size > (8-32K) should increase sequential write throughput at the cost of > decreased positioning performance, which sounds good for WAL (assuming > WAL is never "searched" during normal operation). For large writes (ie. sequential write throughput), it doesn't really matter what the stripe size is; all the disks will have to both seek and write anyhow. /* Steinar */ -- Homepage: http://www.sesse.net/
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: >I have finally gotten my hands on the MSA1500 that we ordered some time >ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + >18 (for data). There's only one controller (an emulex), but I hope You've got 1.4TB assigned to the WAL, which doesn't normally have more than a couple of gigs? Mike Stone
Someone check my math here...
And as always - run benchmarks with your app to verify
Alex.
I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3.
Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never "searched" during normal operation). And for disks holding the data, a larger stripe size (>32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway.
So, does this make sense? Has anyone tried it and seen any performance gains from it?
Regards,
Mikael.
Yeah, it seems to be a waste of disk space (spindles as well?). I was unsure how much activity the WAL disks would have compared to the data disks, so I created an array from 10 disks as the application is very write intense (many spindles / high throughput is crucial). I guess that a mirror of two disks is enough from a disk space perspective, but from a throughput perspective it will limit me to ~25Mb/s (roughly calculated). An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL activity correlates to "normal data" activity (is it 1:1, 1:2, 1:4, ...?) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Michael Stone Sent: den 17 juli 2006 02:04 To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] RAID stripe size question On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: >I have finally gotten my hands on the MSA1500 that we ordered some time >ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + >18 (for data). There's only one controller (an emulex), but I hope You've got 1.4TB assigned to the WAL, which doesn't normally have more than a couple of gigs? Mike Stone ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Hi, Mikael, Mikael Carneholm wrote: > An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL > activity correlates to "normal data" activity (is it 1:1, 1:2, 1:4, > ...?) I think the main difference is that the WAL activity is mostly linear, where the normal data activity is rather random access. Thus, a mirror of few disks (or, with good controller hardware, raid6 on 4 disks or so) for WAL should be enough to cope with a large set of data and index disks, who have a lot more time spent in seeking. Btw, it may make sense to spread different tables or tables and indices onto different Raid-Sets, as you seem to have enough spindles. And look into the commit_delay/commit_siblings settings, they allow you to deal latency for throughput (means a little more latency per transaction, but much more transactions per second throughput for the whole system.) HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
>I think the main difference is that the WAL activity is mostly linear, where the normal data activity is rather random access. That was what I was expecting, and after reading http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html I figured that a different stripe size for the WAL set could be worth investigating. I have now dropped the old sets (10+18) and created two new raid1+0 sets (4 for WAL, 24 for data) instead. Bonnie++ is still running, but I'll post the numbers as soon as it has finished. I did actually use different stripe sizes for the sets as well, 8k for the WAL disks and 64k for the data. It's quite painless to do these things with HBAnywhere, so it's no big deal if I have to go back to another configuration. The battery cache only has 256Mb though and that botheres me, I assume a larger (512Mb - 1Gb) cache would make quite a difference. Oh well. >Btw, it may make sense to spread different tables or tables and indices onto different Raid-Sets, as you seem to have enough spindles. This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirror everything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way. >And look into the commit_delay/commit_siblings settings, they allow you to deal latency for throughput (means a little more latency per transaction, but much more transactions per second throughput for the whole system.) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well. Regards, Mikael.
Hi, Mikael, Mikael Carneholm wrote: > This is something I'd also would like to test, as a common best-practice > these days is to go for a SAME (stripe all, mirror everything) setup. > From a development perspective it's easier to use SAME as the developers > won't have to think about physical location for new tables/indices, so > if there's no performance penalty with SAME I'll gladly keep it that > way. Usually, it's not the developers task to care about that, but the DBAs responsibility. >> And look into the commit_delay/commit_siblings settings, they allow you > to deal latency for throughput (means a little more latency per > transaction, but much more transactions per second throughput for the > whole system.) > > In a previous test, using cd=5000 and cs=20 increased transaction > throughput by ~20% so I'll definitely fiddle with that in the coming > tests as well. How many parallel transactions do you have? Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
>> This is something I'd also would like to test, as a common >> best-practice these days is to go for a SAME (stripe all, mirror everything) setup. >> From a development perspective it's easier to use SAME as the >> developers won't have to think about physical location for new >> tables/indices, so if there's no performance penalty with SAME I'll >> gladly keep it that way. >Usually, it's not the developers task to care about that, but the DBAs responsibility. As we don't have a full-time dedicated DBA (although I'm the one who do most DBA related tasks) I would aim for making physical location as transparent as possible, otherwise I'm afraid I won't be doing anything else than supporting developers with that - and I *do* have other things to do as well :) >> In a previous test, using cd=5000 and cs=20 increased transaction >> throughput by ~20% so I'll definitely fiddle with that in the coming >> tests as well. >How many parallel transactions do you have? That was when running BenchmarkSQL (http://sourceforge.net/projects/benchmarksql) with 100 concurrent users ("terminals"), which I assume means 100 parallel transactions at most. The target application for this DB has 3-4 times as many concurrent connections so it's possible that one would have to find other cs/cd numbers better suited for that scenario. Tweaking bgwriter is another task I'll look into as well.. Btw, here's the bonnie++ results from two different array sets (10+18, 4+24) on the MSA1500: LUN: WAL, 10 disks, stripe size 32K ------------------------------------ Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell01 32G 56139 93 73250 22 16530 3 30488 45 57489 5 477.3 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2458 90 +++++ +++ +++++ +++ 3121 99 +++++ +++ 10469 98 LUN: WAL, 4 disks, stripe size 8K ---------------------------------- Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell01 32G 49170 82 60108 19 13325 2 15778 24 21489 2 266.4 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2432 86 +++++ +++ +++++ +++ 3106 99 +++++ +++ 10248 98 LUN: DATA, 18 disks, stripe size 32K ------------------------------------- Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell01 32G 59990 97 87341 28 19158 4 30200 46 57556 6 495.4 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1640 92 +++++ +++ +++++ +++ 1736 99 +++++ +++ 10919 99 LUN: DATA, 24 disks, stripe size 64K ------------------------------------- Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6 531.8 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++ 10971 100 Regards, Mikael
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com> >Sent: Jul 16, 2006 6:52 PM >To: pgsql-performance@postgresql.org >Subject: [PERFORM] RAID stripe size question > >I have finally gotten my hands on the MSA1500 that we ordered some time >ago. It has 28 x 10K 146Gb drives, > Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. (unless they are old?) I'm not just being pedantic. The correct, let alone optimal, answer to your question depends on your exact HW characteristicsas well as your SW config and your usage pattern. 15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and ~90MB/s outer tracks. If you are doing OLTP-like things, you are more sensitive to latency than most and should use the absolute lowest latencyHDs available within you budget. The current latency best case is 15Krpm FC HDs. >currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope >performance won't suffer too much from that. Raid level is 0+1, >filesystem is ext3. > I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs. 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. If both sets are to run at peak average speed, the Emulex would have to be able to handle ~1050MBps on average. It is doubtful the 1 Emulex can do this. In order to handle this level of bandwidth, a RAID controller must aggregate multiple FC, SCSI, or SATA streams as well asdown any RAID 5 checksumming etc that is required. Very, very few RAID controllers can do >= 1GBps One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to have multiple controllers ;-) Then there is the question of the BW of the bus that the controller is plugged into. ~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel. PCI-E channels are usually good for 1/10 their rated speed in bps as Bps. So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc. At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus. ...and we haven't even touched on OS, SW, and usage pattern issues. Bottom line is that the IO chain is only as fast as its slowest component. >Now to the interesting part: would it make sense to use different stripe >sizes on the separate disk arrays? > The short answer is Yes. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. Tables with OLTP-like characteristics need smaller DB pages and stripes to minimize latency issues (although locality ofreference can make the optimum stripe size larger). Tables with Data Mining like characteristics usually work best with larger DB pages sizes and RAID stripe sizes. OS and FS overhead can make things more complicated. So can DB layout and access pattern issues. Side note: a 10 HD RAID 10 seems a bit much for WAL. Do you really need 375MBps IO on average to your WAL more than youneed IO capacity for other tables? If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device that fits your budget and having said deviceasync mirror to HD. Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, FS, and pg design for best IO for your usagepattern(s). Hope this helps, Ron
On Mon, Jul 17, 2006 at 09:40:30AM -0400, Ron Peacetree wrote: > Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. > (unless they are old?) There are still 146GB SCSI 10000rpm disks being sold here, at least. /* Steinar */ -- Homepage: http://www.sesse.net/
>> This is something I'd also would like to test, as a common
>> best-practice these days is to go for a SAME (stripe all, mirror
everything) setup.
>> From a development perspective it's easier to use SAME as the
>> developers won't have to think about physical location for new
>> tables/indices, so if there's no performance penalty with SAME I'll
>> gladly keep it that way.
>Usually, it's not the developers task to care about that, but the DBAs
responsibility.
As we don't have a full-time dedicated DBA (although I'm the one who do
most DBA related tasks) I would aim for making physical location as
transparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other things
to do as well :)
>> In a previous test, using cd=5000 and cs=20 increased transaction
>> throughput by ~20% so I'll definitely fiddle with that in the coming
>> tests as well.
>How many parallel transactions do you have?
That was when running BenchmarkSQL
(http://sourceforge.net/projects/benchmarksql ) with 100 concurrent users
("terminals"), which I assume means 100 parallel transactions at most.
The target application for this DB has 3-4 times as many concurrent
connections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is another
task I'll look into as well..
Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:
LUN: WAL, 10 disks, stripe size 32K
------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 56139 93 73250 22 16530 3 30488 45 57489 5
477.3 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2458 90 +++++ +++ +++++ +++ 3121 99 +++++ +++
10469 98
LUN: WAL, 4 disks, stripe size 8K
----------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 49170 82 60108 19 13325 2 15778 24 21489 2
266.4 0
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2432 86 +++++ +++ +++++ +++ 3106 99 +++++ +++
10248 98
LUN: DATA, 18 disks, stripe size 32K
-------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 59990 97 87341 28 19158 4 30200 46 57556 6
495.4 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 1640 92 +++++ +++ +++++ +++ 1736 99 +++++ +++
10919 99
LUN: DATA, 24 disks, stripe size 64K
-------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6
531.8 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++
10971 100
These bonnie++ number are very worrying. Your controller should easily max out your FC interface on these tests passing 192MB/sec with ease on anything more than an 6 drive RAID 10 . This is a bad omen if you want high performance... Each mirror pair can do 60-80MB/sec. A 24Disk RAID 10 can do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not unreachable, but time and again, we see these bad perf numbers from FC and SCSI systems alike. Consider a different controller, because this one is not up to snuff. A single drive would get better numbers than your 4 disk RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to 120Mb/sec. If you can't swap out, software RAID may turn out to be your friend. The only saving grace is that this is OLTP, and perhaps, just maybe, the controller will be better at ordering IOs, but I highly doubt it.
Please people, do the numbers, benchmark before you buy, many many HBAs really suck under Linux/Free BSD, and you may end up paying vast sums of money for very sub-optimal performance (I'd say sub-standard, but alas, it seems that this kind of poor performance is tolerated, even though it's way off where it should be). There's no point having a 40disk cab, if your controller can't handle it.
Maximum theoretical linear throughput can be acheived in a White Box for under $20k, and I have seen this kind of system outperform a server 5 times it's price even in OLTP.
>Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. In the spec we got from HP, they are listed as model 286716-B22 (http://www.dealtime.com/xPF-Compaq_HP_146_8_GB_286716_B22)which seems to run at 10K. Don't know how old those are, but that'swhat we got from HP anyway. >15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic? > 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. I guess it's still limited by the 2Gbit FC (192Mb/s), right? >Very, very few RAID controllers can do >= 1GBps One thing that help greatly with bursty IO patterns is to up your batterybacked RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to havemultiple controllers ;-) I use 90% of the raid cache for writes, don't think I could go higher than that. Too bad the emulex only has 256Mb though:/ >Then there is the question of the BW of the bus that the controller is plugged into. >~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel. >PCI-E channels are usually good for 1/10 their rated speed in bps as Bps. >So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc. >At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus. The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=), whichuses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet. >>Now to the interesting part: would it make sense to use different >>stripe sizes on the separate disk arrays? >> >The short answer is Yes. Ok >WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("Asstripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives thatan average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance,but decreasing positioning performance.") I guess I'll have to find out which theory that holds by good ol´ trial and error... :) - Mikael
Mikael Carneholm wrote: > > Btw, here's the bonnie++ results from two different array sets (10+18, > 4+24) on the MSA1500: > > > LUN: DATA, 24 disks, stripe size 64K > ------------------------------------- > Version 1.03 ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6 > 531.8 1 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > /sec %CP > 16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++ > 10971 100 > It might be interesting to see if 128K or 256K stripe size gives better sequential throughput, while still leaving the random performance ok. Having said that, the seeks/s figure of 531 not that great - for instance I've seen a 12 disk (15K SCSI) system report about 1400 seeks/s in this test. Sorry if you mentioned this already - but what OS and filesystem are you using? (if Linux and ext3, it might be worth experimenting with xfs or jfs). Cheers Mark
-----Original Message----- >From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com> >Sent: Jul 17, 2006 5:16 PM >To: Ron Peacetree <rjpeace@earthlink.net>, pgsql-performance@postgresql.org >Subject: RE: [PERFORM] RAID stripe size question > >>15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. > >Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic? > Ah, the games vendors play. "average seek time" for a 10Krpm HD may very well be 4.9ms. However, what matters to you theuser is "average =access= time". The 1st is how long it takes to position the heads to the correct track. The 2nd ishow long it takes to actually find and get data from a specified HD sector. >> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. > >I guess it's still limited by the 2Gbit FC (192Mb/s), right? > No. A decent HBA has multiple IO channels on it. So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA llController) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!). Nominally, this card can push 8Gbps= 800MBps. ~600-700MBps is the RW number. Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080. >>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with >>bursty IO patterns is to up your battery backed RAID cache as high as you possibly >>can. Even multiple GBs of BBC can be worth it. >>Another reason to have multiple controllers ;-) > >I use 90% of the raid cache for writes, don't think I could go higher than that. >Too bad the emulex only has 256Mb though :/ > If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater. I've definitelyseen access patterns that benefitted from increased RAID cache for any size I could actually install. For thoseaccess patterns, no amount of RAID cache commercially available was enough to find the "flattening" point of the cachepercentage curve. 256MB of BB RAID cache per HBA is just not that much for many IO patterns. >The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=), whichuses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet. > This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an8 HD RAID 10 set made of 75MBps ASTR HD's. 28 such HDs are =definitely= IO choked on this HBA. The arithmatic suggests you need a better HBA or more HBAs or both. >>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. > >According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("Asstripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives thatan average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance,but decreasing positioning performance.") > >I guess I'll have to find out which theory that holds by good ol� trial and error... :) > IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW.
-----Original Message-----
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>Sent: Jul 17, 2006 5:16 PM
>To: Ron Peacetree < rjpeace@earthlink.net>, pgsql-performance@postgresql.org
>Subject: RE: [PERFORM] RAID stripe size question
>
>>15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms.
>
>Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?
>
Ah, the games vendors play. "average seek time" for a 10Krpm HD may very well be 4.9ms. However, what matters to you the user is "average =access= time". The 1st is how long it takes to position the heads to the correct track. The 2nd is how long it takes to actually find and get data from a specified HD sector.
>> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s.
>
>I guess it's still limited by the 2Gbit FC (192Mb/s), right?
>
No. A decent HBA has multiple IO channels on it. So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!). Nominally, this card can push 8Gbps= 800MBps. ~600-700MBps is the RW number.
Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.
>>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with
>>bursty IO patterns is to up your battery backed RAID cache as high as you possibly
>>can. Even multiple GBs of BBC can be worth it.
>>Another reason to have multiple controllers ;-)
>
>I use 90% of the raid cache for writes, don't think I could go higher than that.
>Too bad the emulex only has 256Mb though :/
>
If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater. I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install. For those access patterns, no amount of RAID cache commercially available was enough to find the "flattening" point of the cache percentage curve. 256MB of BB RAID cache per HBA is just not that much for many IO patterns.
90% as in 90% of the RAM, not 90% hit rate I'm imagining.
>The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID= ), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.
>
This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's.
28 such HDs are =definitely= IO choked on this HBA.
Not they aren't. This is OLTP, not data warehousing. I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second.
The arithmatic suggests you need a better HBA or more HBAs or both.
>>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs.
unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO.
>
>According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.")
>
>I guess I'll have to find out which theory that holds by good ol� trial and error... :)
>
IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW.
New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time. Updated records are modified in place unless they have grown too big to be in place. If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb. How much data does your average transaction write? How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be. Of course, the fastest and most accurate way is probably just to try different settings and see how it works. Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic.
Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting. Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace. You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels, which it's clear your controller isn't).
For example, you bonnie++ benchmark shows 538 seeks/second. If on each seek one writes 8k of data (one block) then your total throughput to disk is 538*8k=4304k which is just 4MB/second actual throughput for WAL, which is about what I estimated in my calculations earlier. A single RAID 1 will easily suffice to handle WAL for this kind of OLTP xact rate. Even if you write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k or around 34Meg, still within the capability of a correctly running RAID 1, and even with your low bonnie scores, within the capability of your 4 disk RAID 10.
Remember when it comes to OLTP, massive serial throughput is not gonna help you, it's low seek times, which is why people still buy 15k RPM drives, and why you don't necessarily need a honking SAS/SATA controller which can harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course, once you have a bunch of OLTP data, people will innevitably want reports on that stuff, and what was mainly an OLTP database suddenly becomes a data warehouse in a matter of months, so don't neglect to consider that problem also.
Also more RAM on the RAID card will seriously help bolster your transaction rate, as your controller can queue up a whole bunch of table writes and burst them all at once in a single seek, which will increase your overall throughput by as much as an order of magnitude (and you would have to increase WAL accordingly therefore).
But finally - if your card/cab isn't performing RMA it. Send the damn thing back and get something that actualy can do what it should. Don't tolerate manufacturers BS!!
Alex
>From: Alex Turner <armtuk@gmail.com> >Sent: Jul 18, 2006 12:21 AM >To: Ron Peacetree <rjpeace@earthlink.net> >Cc: Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, pgsql-performance@postgresql.org >Subject: Re: [PERFORM] RAID stripe size question > >On 7/17/06, Ron Peacetree <rjpeace@earthlink.net> wrote: >> >> -----Original Message----- >> >From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com> >> >Sent: Jul 17, 2006 5:16 PM >> >To: Ron Peacetree <rjpeace@earthlink.net>, >> pgsql-performance@postgresql.org >> >Subject: RE: [PERFORM] RAID stripe size question >> > >> >I use 90% of the raid cache for writes, don't think I could go higher >> >than that. >> >Too bad the emulex only has 256Mb though :/ >> > >> If your RAID cache hit rates are in the 90+% range, you probably would >> find it profitable to make it greater. I've definitely seen access patterns >> that benefitted from increased RAID cache for any size I could actually >> install. For those access patterns, no amount of RAID cache commercially >> available was enough to find the "flattening" point of the cache percentage >> curve. 256MB of BB RAID cache per HBA is just not that much for many IO >> patterns. > >90% as in 90% of the RAM, not 90% hit rate I'm imagining. > Either way, =particularly= for OLTP-like I/O patterns, the more RAID cache the better unless the IO pattern is completelyrandom. In which case the best you can do is cache the entire sector map of the RAID set and use as many spindlesas possible for the tables involved. I've seen high end set ups in Fortune 2000 organizations that look like someof the things you read about on tpc.org: =hundreds= of HDs are used. Clearly, completely random IO patterns are to be avoided whenever and however possible. Thankfully, most things can be designed to not have completely random IO and stuff like WAL IO are definitely not random. The important point here about cache size is that unless you make cache large enough that you see a flattening in the cachebehavior, you probably can still use more cache. Working sets are often very large for DB applications. >>The controller is a FC2143 ( >> http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=), >> which uses PCI-E. Don't know how it compares to other controllers, haven't >> had the time to search for / read any reviews yet. >> > >> This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on >> it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of >> 75MBps ASTR HD's. >> >> 28 such HDs are =definitely= IO choked on this HBA. > >Not they aren't. This is OLTP, not data warehousing. I already posted math >for OLTP throughput, which is in the order of 8-80MB/second actual data >throughput based on maximum theoretical seeks/second. > WAL IO patterns are not OLTP-like. Neither are most support or decision support IO patterns. Even in an OLTP system, thereare usually only a few scenarios and tables where the IO pattern is pessimal. Alex is quite correct that those few will be the bottleneck on overall system performance if the system's primary functionis OLTP-like. For those few, you dedicate as many spindles and RAID cache as you can afford and as show any performance benefit. I'veseen an entire HBA maxed out with cache and as many HDs as would saturate the attainable IO rate dedicated to =1= table(unfortunately SSD was not a viable option in this case). >>The arithmetic suggests you need a better HBA or more HBAs or both. >> >> >> >>WAL's are basically appends that are written in bursts of your chosen >> log chunk size and that are almost never read afterwards. Big DB pages and >> big RAID stripes makes sense for WALs. > > >unless of course you are running OLTP, in which case a big stripe isn't >necessary, spend the disks on your data parition, because your WAL activity >is going to be small compared with your random IO. > Or to put it another way, the scenarios and tables that have the most random looking IO patterns are going to be the performancebottleneck on the whole system. In an OLTP-like system, WAL IO is unlikely to be your biggest performance issue. As in any other performance tuning effort, you only gain by speeding up the current bottleneck. >> >> >According to >> http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it >> seems to be the other way around? ("As stripe size is decreased, files are >> broken into smaller and smaller pieces. This increases the number of drives >> that an average file will use to hold all the blocks containing the data of >> that file, theoretically increasing transfer performance, but decreasing >> positioning performance.") >> > >> >I guess I'll have to find out which theory that holds by good ol? trial >> and error... :) >> > >> IME, stripe sizes of 64, 128, or 256 are the most common found to be >> optimal for most access patterns + SW + FS + OS + HW. > > >New records will be posted at the end of a file, and will only increase the >file by the number of blocks in the transactions posted at write time. >Updated records are modified in place unless they have grown too big to be >in place. If you are updated mutiple tables on each transaction, a 64kb >stripe size or lower is probably going to be best as block sizes are just >8kb. > Here's where Theory and Practice conflict. pg does not "update" and modify in place in the true DB sense. A pg UPDATE isactually an insert of a new row or rows, !not! a modify in place. I'm sure Alex knows this and just temporily forgot some of the context of this thread :-) The append behavior Alex refers to is the best case scenario for pg where a) the table is unfragmented and b) the file segmentof say 2GB holding that part of the pg table is not full. VACUUM and autovacuum are your friend. >How much data does your average transaction write? How many xacts per >second, this will help determine how many writes your cache will queue up >before it flushes, and therefore what the optimal stripe size will be. Of >course, the fastest and most accurate way is probably just to try different >settings and see how it works. Alas some controllers seem to handle some >stripe sizes more effeciently in defiance of any logic. > >Work out how big your xacts are, how many xacts/second you can post, and you >will figure out how fast WAL will be writting. Allocate enough disk for >peak load plus planned expansion on WAL and then put the rest to >tablespace. You may well find that a single RAID 1 is enough for WAL (if >you acheive theoretical performance levels, which it's clear your controller >isn't). > This is very good advice. >For example, you bonnie++ benchmark shows 538 seeks/second. If on each seek >one writes 8k of data (one block) then your total throughput to disk is >538*8k=4304k which is just 4MB/second actual throughput for WAL, which is >about what I estimated in my calculations earlier. A single RAID 1 will >easily suffice to handle WAL for this kind of OLTP xact rate. Even if you >write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k >or around 34Meg, still within the capability of a correctly running RAID 1, >and even with your low bonnie scores, within the capability of your 4 disk >RAID 10. > I'd also suggest that you figure out what the max access per sec is for HDs and make sure you are attaining it since thiswill set the ceiling on your overall system performance. Like I've said, I've seen organizations dedicate as much HW as could make any difference on a per table basis for importantOLTP systems. >Remember when it comes to OLTP, massive serial throughput is not gonna help >you, it's low seek times, which is why people still buy 15k RPM drives, and >why you don't necessarily need a honking SAS/SATA controller which can >harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course, >once you have a bunch of OLTP data, people will innevitably want reports on >that stuff, and what was mainly an OLTP database suddenly becomes a data >warehouse in a matter of months, so don't neglect to consider that problem >also. > One Warning to expand on Alex's point here. DO !NOT! use the same table schema and/or DB for your reporting and OLTP. You will end up with a DBMS that is neither good at reporting nor OLTP. >Also more RAM on the RAID card will seriously help bolster your transaction >rate, as your controller can queue up a whole bunch of table writes and >burst them all at once in a single seek, which will increase your overall >throughput by as much as an order of magnitude (and you would have to >increase WAL accordingly therefore). > *nods* >But finally - if your card/cab isn't performing RMA it. Send the damn thing >back and get something that actualy can do what it should. Don't tolerate >manufacturers BS!! > On this Alex and I are in COMPLETE agreement. Ron
> This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. Looking at http://h30094.www3.hp.com/product.asp?sku=2260908&extended=1, I notice that the controller has a Ultra160 SCSI interface which implies that the theoretical max throughput is 160Mb/s. Ouch. However, what's more important is the seeks/s - ~530/s on a 28 disk array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array as mentioned by Mark here: http://archives.postgresql.org/pgsql-performance/2006-07/msg00170.php. Could be the disk RPM (10K vs 15K) that makes the difference here... I will test another stripe size (128K) for the DATA lun (28 disks) to see what difference that makes, I think I read somewhere that linux flushes blocks of 128K at a time, so it might be worth evaluating. /Mikael
On 7/18/06 6:34 AM, "Mikael Carneholm" <Mikael.Carneholm@WirelessCar.com> wrote:
> However, what's more important is the seeks/s - ~530/s on a 28 disk
> array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array
I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:
=========== Single Stream ============
With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ 30850 99 +++++ +++ +++++ +++
=========== Two Streams ============
Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26085 90 +++++ +++ 5700 98 21448 97 +++++ +++ 4381 97
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 26082 99 +++++ +++ 5588 98 21399 88 +++++ +++ 4272 97
So that’s 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
=======================
- Luke
Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5.
Alex.
Mikael,
On 7/18/06 6:34 AM, "Mikael Carneholm" < Mikael.Carneholm@WirelessCar.com> wrote:
> However, what's more important is the seeks/s - ~530/s on a 28 disk
> array is quite lousy compared to the 1400/s on a 12 x 15Kdisk arrayI'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:
=========== Single Stream ============
With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPthumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP16 +++++ +++ +++++ +++ +++++ +++ 30850 99 +++++ +++ +++++ +++
=========== Two Streams ============
Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together):Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPthumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP16 26085 90 +++++ +++ 5700 98 21448 97 +++++ +++ 4381 97
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPthumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP16 26082 99 +++++ +++ 5588 98 21399 88 +++++ +++ 4272 97
So that's 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
=======================
- Luke
On Tue, 2006-07-18 at 14:27, Alex Turner wrote: > This is a great testament to the fact that very often software RAID > will seriously outperform hardware RAID because the OS guys who > implemented it took the time to do it right, as compared with some > controller manufacturers who seem to think it's okay to provided > sub-standard performance. > > Based on the bonnie++ numbers comming back from your array, I would > also encourage you to evaluate software RAID, as you might see > significantly better performance as a result. RAID 10 is also a good > candidate as it's not so heavy on the cache and CPU as RAID 5. Also, consider testing a mix, where your hardware RAID controller does the mirroring and the OS stripes ((R)AID 0) over the top of it. I've gotten good performance from mediocre hardware cards doing this. It has the advantage of still being able to use the battery backed cache and its instant fsync while not relying on some cards that have issues layering RAID layers one atop the other.
Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)? If so, what were the results? Ron -----Original Message----- >From: Scott Marlowe <smarlowe@g2switchworks.com> >Sent: Jul 18, 2006 3:37 PM >To: Alex Turner <armtuk@gmail.com> >Cc: Luke Lonergan <llonergan@greenplum.com>, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, Ron Peacetree <rjpeace@earthlink.net>,pgsql-performance@postgresql.org >Subject: Re: [PERFORM] RAID stripe size question > >On Tue, 2006-07-18 at 14:27, Alex Turner wrote: >> This is a great testament to the fact that very often software RAID >> will seriously outperform hardware RAID because the OS guys who >> implemented it took the time to do it right, as compared with some >> controller manufacturers who seem to think it's okay to provided >> sub-standard performance. >> >> Based on the bonnie++ numbers comming back from your array, I would >> also encourage you to evaluate software RAID, as you might see >> significantly better performance as a result. RAID 10 is also a good >> candidate as it's not so heavy on the cache and CPU as RAID 5. > >Also, consider testing a mix, where your hardware RAID controller does >the mirroring and the OS stripes ((R)AID 0) over the top of it. I've >gotten good performance from mediocre hardware cards doing this. It has >the advantage of still being able to use the battery backed cache and >its instant fsync while not relying on some cards that have issues >layering RAID layers one atop the other.
Nope, haven't tried that. At the time I was testing this I didn't even think of trying it. I'm not even sure I'd heard of RAID 50 at the time... :) I basically had an old MegaRAID 4xx series card in a dual PPro 200 and a stack of 6 9 gig hard drives. Spare parts. And even though the RAID 1+0 was relatively much faster on this hardware, the Dual P IV 2800 with a pair of 15k USCSI drives and a much later model MegaRAID at it for lunch with a single mirror set, and was plenty fast for our use at the time, so I never really had call to test it in production. But it definitely made our test server, the aforementioned PPro200 machine, more livable. On Tue, 2006-07-18 at 14:43, Ron Peacetree wrote: > Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)? If so, what were the results? > > Ron > > -----Original Message----- > >From: Scott Marlowe <smarlowe@g2switchworks.com> > >Sent: Jul 18, 2006 3:37 PM > >To: Alex Turner <armtuk@gmail.com> > >Cc: Luke Lonergan <llonergan@greenplum.com>, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, Ron Peacetree <rjpeace@earthlink.net>,pgsql-performance@postgresql.org > >Subject: Re: [PERFORM] RAID stripe size question > > > >On Tue, 2006-07-18 at 14:27, Alex Turner wrote: > >> This is a great testament to the fact that very often software RAID > >> will seriously outperform hardware RAID because the OS guys who > >> implemented it took the time to do it right, as compared with some > >> controller manufacturers who seem to think it's okay to provided > >> sub-standard performance. > >> > >> Based on the bonnie++ numbers comming back from your array, I would > >> also encourage you to evaluate software RAID, as you might see > >> significantly better performance as a result. RAID 10 is also a good > >> candidate as it's not so heavy on the cache and CPU as RAID 5. > > > >Also, consider testing a mix, where your hardware RAID controller does > >the mirroring and the OS stripes ((R)AID 0) over the top of it. I've > >gotten good performance from mediocre hardware cards doing this. It has > >the advantage of still being able to use the battery backed cache and > >its instant fsync while not relying on some cards that have issues > >layering RAID layers one atop the other. >
According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, ->>>>theoretically increasing transfer performance, but decreasing positioning performance.") Mikael, In OLTP you utterly need best possible latency. If you decompose the response time if you physical request you will see positioning performance plays the dominant role in the response time (ignore for a moment caches and their effects). So, if you need really good response times of your SQL queries, choose 15 rpm disks(and add as much cache as possible to magnify the effect ;) ) Best Regards. Milen
On 7/18/06, Alex Turner <armtuk@gmail.com> wrote: > Remember when it comes to OLTP, massive serial throughput is not gonna help > you, it's low seek times, which is why people still buy 15k RPM drives, and > why you don't necessarily need a honking SAS/SATA controller which can > harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course, hm. i'm starting to look seriously at SAS to take things to the next level. it's really not all that expensive, cheaper than scsi even, and you can mix/match sata/sas drives in the better enclosures. the real wild card here is the raid controller. i still think raptors are the best bang for the buck and SAS gives me everything i like about sata and scsi in one package. moving a gigabyte around/sec on the server, attached or no, is pretty heavy lifting on x86 hardware. merlin
Merlin, > moving a gigabyte around/sec on the server, attached or no, > is pretty heavy lifting on x86 hardware. Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA. WRT seek performance, we're doing 2500 seeks per second on the Sun/Thumper on 36 disks. You might do better with 15K RPM disks and great controllers, but I haven't seen it reported yet. BTW - I'm curious about the HP P600 SAS host based RAID controller - it has very good specs, but is the Linux driver solid? - Luke
On 8/3/06, Luke Lonergan <LLonergan@greenplum.com> wrote: > Merlin, > > > moving a gigabyte around/sec on the server, attached or no, > > is pretty heavy lifting on x86 hardware. > Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID > and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA. that is pretty amazing, that works out to 55 mb/sec/drive, close to theoretical maximums. are you using pci-e sata controller and raptors im guessing? this is doubly impressive if we are talking raid 5 here. do you find that software raid is generally better than hardware at the highend? how much does this tax the cpu? > WRT seek performance, we're doing 2500 seeks per second on the > Sun/Thumper on 36 disks. You might do better with 15K RPM disks and > great controllers, but I haven't seen it reported yet. thats pretty amazing too. only a highly optimized raid system can pull this off. > BTW - I'm curious about the HP P600 SAS host based RAID controller - it > has very good specs, but is the Linux driver solid? have no clue. i sure hope i dont go through the same headaches as with ibm scsi drivers (rebranded adaptec btw). sas looks really promising however. the adaptec sas gear is so cheap it might be worth it to just buy some and see what it can do. merlin
> WRT seek performance, we're doing 2500 seeks per second on the Sun/Thumper on 36 disks. Luke, Have you had time to run benchmarksql against it yet? I'm just curious about the IO seeks/s vs. transactions/minute correlation... /Mikael