Re: Sunfire X4500 recommendations - Mailing list pgsql-performance

From Matt Smiley
Subject Re: Sunfire X4500 recommendations
Date
Msg-id 4609904A020000280001FF78@rtk_gwim1.rentrak.com
Whole thread Raw
In response to Sunfire X4500 recommendations  ("Matt Smiley" <mss@rentrak.com>)
Responses Re: Sunfire X4500 recommendations  (david@lang.hm)
List pgsql-performance
Hi Dimitri,

First of all, thanks again for the great feedback!

Yes, my I/O load is mostly read operations.  There are some bulk writes done in the background periodically throughout
theday, but these are not as time-sensitive.  I'll have to do some testing to find the best balance of read vs. write
speedand tolerance of disk failure vs. usable diskspace. 

I'm looking forward to seeing the results of your OLTP tests!  Good luck!  Since I won't be doing that myself, it'll be
allnew to me. 

About disk failure, I certainly agree that increasing the number of disks will decrease the average time between disk
failures. Apart from any performance considerations, I wanted to get a clear idea of the risk of data loss under
variousRAID configurations.  It's a handy reference, so I thought I'd share it: 

--------

The goal is to calculate the probability of data loss when we loose a certain number of disks within a short timespan
(e.g.loosing a 2nd disk before replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we will loose data if any
diskgroup (i.e. mirror or parity-group) looses 2 disks.  For RAID 60 and Z2, we will loose data if 3 disks die in the
sameparity group.  The parity groups can include arbitrarily many disks.  Having larger groups gives us more usable
diskspacebut less protection.  (Naturally we're more likely to loose 2 disks in a group of 50 than in a group of 5.) 

    g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
    n = total number of disks
    risk of loosing any 1 disk = 1/n
    risk of loosing 1 disk from a particular group = g/n
    risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
    risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)

For the x4500, we have 48 disks.  If we stripe our data across all those disks, then these are our configuration
options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
              2          24           48            24              0.09%
              3          16           48            32              0.27%
              4          12           48            36              0.53%
              6           8           48            40              1.33%
              8           6           48            42              2.48%
             12           4           48            44              5.85%
             24           2           48            46             24.47%
             48           1           48            47            100.00%

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
              2          24           48           n/a                n/a
              3          16           48            16              0.01%
              4          12           48            24              0.02%
              6           8           48            32              0.12%
              8           6           48            36              0.32%
             12           4           48            40              1.27%
             24           2           48            44             11.70%
             48           1           48            46            100.00%

So, in terms of fault tolerance:
 - RAID 60 and Z2 always beat RAID 10, since they never risk data loss when only 2 disks fail.
 - RAID 10 always beats RAID 50 and Z, since it has the largest number of disk groups across which to spread the risk.
 - Having more parity groups increases fault tolerance but decreases usable diskspace.

That's all assuming each disk has an equal chance of failure, which is probably true since striping should distribute
theworkload evenly.  And again, these probabilities are only describing the case where we don't have enough time
betweendisk failures to recover the array. 

In terms of performance, I think RAID 10 should always be best for write speed.  (Since it doesn't calculate parity,
writinga new block doesn't require reading the rest of the RAID stripe just to recalculate the parity bits.)  I think
it'salso normally just as fast for reading, since the controller can load-balance the pending read requests to both
sidesof each mirror. 

--------



pgsql-performance by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: How to enable jdbc???
Next
From: david@lang.hm
Date:
Subject: Re: Sunfire X4500 recommendations