Re: RAID stripe size question - Mailing list pgsql-performance

From Alex Turner
Subject Re: RAID stripe size question
Date
Msg-id 33c6269f0607172121i1b6610b1j3fc686d4132f880b@mail.gmail.com
Whole thread Raw
In response to Re: RAID stripe size question  (Ron Peacetree <rjpeace@earthlink.net>)
Responses Re: RAID stripe size question  ("Merlin Moncure" <mmoncure@gmail.com>)
List pgsql-performance


On 7/17/06, Ron Peacetree <rjpeace@earthlink.net> wrote:
-----Original Message-----
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>Sent: Jul 17, 2006 5:16 PM
>To: Ron  Peacetree < rjpeace@earthlink.net>, pgsql-performance@postgresql.org
>Subject: RE: [PERFORM] RAID stripe size question
>
>>15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
>
>Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?
>
Ah, the games vendors play.  "average seek time" for a 10Krpm HD may very well be 4.9ms.  However, what matters to you the user is "average =access= time".  The 1st is how long it takes to position the heads to the correct track.  The 2nd is how long it takes to actually find and get data from a specified HD sector.

>> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
>
>I guess it's still limited by the 2Gbit FC (192Mb/s), right?
>
No.  A decent HBA has multiple IO channels on it.  So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).  Nominally, this card can push 8Gbps= 800MBps.  ~600-700MBps is the RW number.

Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.

>>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with
>>bursty IO patterns is to up your battery backed RAID cache as high as you possibly
>>can.  Even multiple GBs of BBC can be worth it.
>>Another reason to have multiple controllers ;-)
>
>I use 90% of the raid cache for writes, don't think I could go higher than that.
>Too bad the emulex only has 256Mb though :/
>
If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.  I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install.  For those access patterns, no amount of RAID cache commercially available was enough to find the "flattening" point of the cache percentage curve.  256MB of BB RAID cache per HBA is just not that much for many IO patterns.

90% as in 90% of the RAM, not 90% hit rate I'm imagining.

>The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID= ), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.
>
This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's.

28 such HDs are =definitely= IO choked on this HBA.


Not they aren't.  This is OLTP, not data warehousing.  I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second.

The arithmatic suggests you need a better HBA or more HBAs or both.


>>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards.  Big DB pages and big RAID stripes makes sense for WALs.

unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO.

>
>According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.")
>
>I guess I'll have to find out which theory that holds by good ol� trial and error... :)
>
IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW.

New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time.  Updated records are modified in place unless they have grown too big to be in place.  If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb.  How much data does your average transaction write?  How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be.  Of course, the fastest and most accurate way is probably just to try different settings and see how it works.  Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic.

Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting.  Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace.  You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels, which it's clear your controller isn't).

For example, you bonnie++ benchmark shows 538 seeks/second.  If on each seek one writes 8k of data (one block) then your total throughput to disk is 538*8k=4304k which is just 4MB/second actual throughput for WAL, which is about what I estimated in my calculations earlier.   A single RAID 1 will easily suffice to handle WAL for this kind of OLTP xact rate.  Even if you write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k or around 34Meg, still within the capability of a correctly running RAID 1, and even with your low bonnie scores, within the capability of your 4 disk RAID 10.

Remember when it comes to OLTP, massive serial throughput is not gonna help you, it's low seek times, which is why people still buy 15k RPM drives, and why you don't necessarily need a honking SAS/SATA controller which can harness the full 1066MB/sec of your PCI-X bus, or more for PCIe.  Of course, once you have a bunch of OLTP data, people will innevitably want reports on that stuff, and what was mainly an OLTP database suddenly becomes a data warehouse in a matter of months, so don't neglect to consider that problem also.

Also more RAM on the RAID card will seriously help bolster your transaction rate, as your controller can queue up a whole bunch of table writes and burst them all at once in a single seek, which will increase your overall throughput by as much as an order of magnitude (and you would have to increase WAL accordingly therefore).

But finally - if your card/cab isn't performing RMA it.  Send the damn thing back and get something that actualy can do what it should.  Don't tolerate manufacturers BS!!

Alex

pgsql-performance by date:

Previous
From: Ron Peacetree
Date:
Subject: Re: RAID stripe size question
Next
From: Kapadaidakis Yannis
Date:
Subject: Re: Problem with bitmap-index-scan plan