Re: Sunfire X4500 recommendations - Mailing list pgsql-performance

From Matt Smiley
Subject Re: Sunfire X4500 recommendations
Date
Msg-id 460B0F6302000028000202D3@rtk_gwim1.rentrak.com
Whole thread Raw
In response to Sunfire X4500 recommendations  ("Matt Smiley" <mss@rentrak.com>)
Responses Re: Sunfire X4500 recommendations
List pgsql-performance
Hi David,

Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the critique.

First, let me correct myself: The formulas for the risk of loosing data when you loose 2 and 3 disks shouldn't have
includedthe first term (g/n).  I'll give the corrected formulas and tables at the end of the email. 


> please explain why you are saying that the risk of loosing any 1 disk is
> 1/n. shouldn't it be probability of failure * n instead?

1/n represents the assumption that all disks have an equal probability of being the next one to fail.  This seems like
afair assumption in general for the active members of a stripe (not including hot spares).  A possible exception would
bethe parity disks (because reads always skip them and writes always hit them), but that's only a consideration if the
RAIDconfiguration used dedicated disks for parity instead of distributing it across the RAID 5/6 group members.  Apart
fromthat, whether the workload is write-heavy or read-heavy, sequential or scattered, the disks in the stripe ought to
handlea roughly equivalent number of iops over their lifetime. 


> following this logic the risk of loosing all 48 disks in a single group of
> 48 would be 100%

Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If you loose even 1 active member of the
stripe,the probability of loosing your data is 100%. 


> also what you are looking for is the probability of the second (and third)
> disks failing in time X (where X is the time nessasary to notice the
> failure, get a replacement, and rebuild the disk)

Yep, that's exactly what I'm looking for.  That's why I said, "these probabilities are only describing the case where
wedon't have enough time between disk failures to recover the array."  My goal wasn't to estimate how long time X is.
(Itdoesn't seem like a generalizable quantity; due partly to logistical and human factors, it's unique to each
operatingenvironment.)  Instead, I start with the assumption that time X has been exceeded, and we've lost a 2nd (or
3rd)disk in the array.  Given that assumption, I wanted to show the probability that the loss of the 2nd disk has
causedthe stripe to become unrecoverable. 

We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 and n/g disks, depending on how lucky you are.
Iwanted to quantify the amount of luck required, as a risk management tool.  The duration of time X can be minimized
withhot spares and attentive administrators, but the risk after exceeding time X can only be minimized (as far as I
know)by configuring the RAID stripe with small enough underlying failure groups. 


> the killer is the time needed to rebuild the disk, with multi-TB arrays
> is't sometimes faster to re-initialize the array and reload from backup
> then it is to do a live rebuild (the kernel.org servers had a raid failure
> recently and HPA mentioned that it took a week to rebuild the array, but
> it would have only taken a couple days to do a restore from backup)

That's very interesting.  I guess the rebuild time also would depend on how large the damaged failure group was.  Under
RAID10, for example, I think you'd still only have to rebuild 1 disk from its mirror, regardless of how many other
diskswere in the stripe, right?  So shortening the rebuild time may be another good motivation to keep the failure
groupssmall. 


> add to this the fact that disk failures do not appear to be truely
> independant from each other statisticly (see the recent studies released
> by google and cmu), and I wouldn't bother with single-parity for a

I don't think I've seen the studies you mentioned.  Would you cite them please?  This may not be typical of everyone's
experience,but what I've seen during in-house load tests is an equal I/O rate for each disk in my stripe, using
short-durationsampling intervals to avoid long-term averaging effects.  This is what I expected to find, so I didn't
delvedeeper. 

Certainly it's true that some disks may be more heavily burdened than others for hours or days, but I wouldn't expect
anybias from an application-driven access pattern to persist for a significant fraction of a disk's lifespan.  The only
influenceI'd expect to bias the cumulative I/O handled by a disk over its entire life would be its role in the RAID
configuration. Hot spares will have minimal wear-and-tear until they're activated.  Dedicated parity disks will
probablylive longer than data disks, unless the workload is very heavily oriented towards small writes (e.g. logging). 


> multi-TB array. If the data is easy to recreate (including from backup) or
> short lived (say a database of log data that cycles every month or so) I
> would just do RAID-0 and plan on loosing the data on drive failure (this
> assumes that you can afford the loss of service when this happens). if the
> data is more important then I'd do dual-parity or more, along with a hot
> spare so that the rebuild can start as soon as the first failure is
> noticed by the system to give myself a fighting chance to save things.

That sounds like a fine plan.  In my case, downtime is unacceptible (which is, of course, why I'm interested in
quantifyingthe probabilities of data loss). 


Here are the corrected formulas:

Let:
   g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
   n = total number of disks
   risk of loosing any 1 disk = 1/n
Then we have:
   risk of loosing 1 disk from a particular group = g/n
   risk of loosing 2 disks in the same group = (g-1)/(n-1)
   risk of loosing 3 disks in the same group = (g-1)/(n-1) * (g-2)/(n-2)

For the x4500, we have 48 disks.  If we stripe our data across all those disks, then these are our configuration
options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
             2          24           48            24              2.13%
             3          16           48            32              4.26%
             4          12           48            36              6.38%
             6           8           48            40             10.64%
             8           6           48            42             14.89%
            12           4           48            44             23.40%
            16           3           48            45             31.91%
            24           2           48            46             48.94%
            48           1           48            47            100.00%

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
             2          24           48           n/a                n/a
             3          16           48            16              0.09%
             4          12           48            24              0.28%
             6           8           48            32              0.93%
             8           6           48            36              1.94%
            12           4           48            40              5.09%
            16           3           48            42              9.71%
            24           2           48            44             23.40%
            48           1           48            46            100.00%



pgsql-performance by date:

Previous
From: Dan Harris
Date:
Subject: Planner doing seqscan before indexed join
Next
From: Tom Lane
Date:
Subject: Re: Shared buffers, db transactions commited, and write IO on Solaris