Thread: Sunfire X4500 recommendations

From:
"Matt Smiley"
Date:

My company is purchasing a Sunfire x4500 to run our most I/O-bound databases, and I'd like to get some advice on
configurationand tuning.  We're currently looking at: 
 - Solaris 10 + zfs + RAID Z
 - CentOS 4 + xfs + RAID 10
 - CentOS 4 + ext3 + RAID 10
but we're open to other suggestions.

From previous message threads, it looks like some of you have achieved stellar performance under both Solaris 10 U2/U3
withzfs and CentOS 4.4 with xfs.  Would those of you who posted such results please describe how you tuned the OS/fs to
yieldthose figures (e.g. patches, special drivers, read-ahead, checksumming, write-through cache settings, etc.)? 

Most of our servers currently run CentOS/RedHat, and we have little experience with Solaris, but we're not opposed to
Solarisif there's a compelling reason to switch.  For example, it sounds like zfs snapshots may have a lighter
performancepenalty than LVM snapshots.  We've heard that just using LVM (even without active snapshots) imposes a
maximumsequential I/O rate of around 600 MB/s (although we haven't yet reached this limit experimentally). 

By the way, we've also heard that Solaris is "more stable" under heavy I/O load than Linux.  Have any of you
experiencedthis?  It's hard to put much stock in such a blanket statement, but naturally we don't want to introduce
instabilities.

Thanks in advance for your thoughts!

For reference:

Our database cluster will be 3-6 TB in size.  The Postgres installation will be 8.1 (at least initially), compiled to
use32 KB blocks (rather than 8 KB).  The workload will be predominantly OLAP.  The Sunfire X4500 has 2 dual-core
Opterons,16 GB RAM, 48 SATA disks (500 GB/disk * 48 = 24 TB raw -> 12 TB usable under RAID 10). 

So far, we've seen the X4500 deliver impressive but suboptimal results using the out-of-the-box installation of Solaris
+zfs.  The Linux testing is in the early stages (no xfs, yet), but so far it yeilds comparatively modest write rates
andvery poor read and rewrite rates. 

===============================
Results under Solaris with zfs:
===============================

Four concurrent writers:
% time dd if=/dev/zero of=/zpool1/test/50GB-zero1 bs=1024k count=51200 ; time sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero2 bs=1024k count=51200 ; time sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero3 bs=1024k count=51200 ; time sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero4 bs=1024k count=51200 ; time sync

Seq Write (bs = 1 MB):  128 + 122 + 131 + 124 = 505 MB/s

Four concurrent readers:
% time dd if=/zpool1/test/50GB-zero1 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero2 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero3 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero4 of=/dev/null bs=1024k

Seq Read (bs = 1 MB):   181 + 177 + 180 + 178 = 716 MB/s


One bonnie++ process:
% bonnie++ -r 16384 -s 32g:32k -f -n0 -d /zpool1/test/bonnie_scratch

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
thumper1    32G:32k           604173  98 268893  43           543389  59 519.2   3
thumper1,32G:32k,,,604173,98,268893,43,,,543389,59,519.2,3,,,,,,,,,,,,,


4 concurrent synchronized bonnie++ processes:
% bonnie++ -p4
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -p-1

Combined results of 4 sessions:
Seq Output:   124 + 124 + 124 + 140 = 512 MB/s
Rewrite:       93 +  94 +  93 +  96 = 376 MB/s
Seq Input:    192 + 194 + 193 + 197 = 776 MB/s
Random Seek:  327 + 327 + 335 + 332 = 1321 seeks/s


=========================================
Results under CentOS 4 with ext3 and LVM:
=========================================

% bonnie++ -s 32g:32k -f -n0 -d /large_lvm_stripe/test/bonnie_scratch
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
thumper1.rt 32G:32k           346595  94 59448  11           132471  12 479.4   2
thumper1.rtkinternal,32G:32k,,,346595,94,59448,11,,,132471,12,479.4,2,,,,,,,,,,,,,


============================
Summary of bonnie++ results:
============================

                           sequential  sequential    sequential  scattered
Test case                  write MB/s  rewrite MB/s  read MB/s   seeks/s
-------------------------  ----------  ------------  ----------  ---------
Sol10+zfs, 1 process              604           269         543        519
Sol10+zfs, 4 processes            512           376         776       1321
Cent4+ext3+LVM, 1 process         347            59         132        479



From:
Dimitri
Date:

On Friday 23 March 2007 03:20, Matt Smiley wrote:
> My company is purchasing a Sunfire x4500 to run our most I/O-bound
> databases, and I'd like to get some advice on configuration and tuning.
> We're currently looking at: - Solaris 10 + zfs + RAID Z
>  - CentOS 4 + xfs + RAID 10
>  - CentOS 4 + ext3 + RAID 10
> but we're open to other suggestions.
>

Matt,

for Solaris + ZFS you may find answers to all your questions here:

  http://blogs.sun.com/roch/category/ZFS
  http://blogs.sun.com/realneel/entry/zfs_and_databases

Think to measure log (WAL) activity and use separated pool for logs if needed.
Also, RAID-Z is more security-oriented rather performance, RAID-10 should be
a better choice...

Rgds,
-Dimitri

From:
"Matt Smiley"
Date:

Thanks Dimitri!  That was very educational material!  I'm going to think out loud here, so please correct me if you see
anyerrors. 

The section on tuning for OLTP transactions was interesting, although my OLAP workload will be predominantly bulk I/O
overlarge datasets of mostly-sequential blocks. 

The NFS+ZFS section talked about the zil_disable control for making zfs ignore commits/fsyncs.  Given that Postgres'
executordoes single-threaded synchronous I/O like the tar example, it seems like it might benefit significantly from
settingzil_disable=1, at least in the case of frequently flushed/committed writes.  However, zil_disable=1 sounds
unsafefor the datafiles' filesystem, and would probably only be acceptible for the xlogs if they're stored on a
separatefilesystem and you're willing to loose recently committed transactions.  This sounds pretty similar to just
settingfsync=off in postgresql.conf, which is easier to change later, so I'll skip the zil_disable control. 

The RAID-Z section was a little surprising.  It made RAID-Z sound just like RAID 50, in that you can customize the
trade-offbetween iops versus usable diskspace and fault-tolerance by adjusting the number/size of parity-protected disk
groups. The only difference I noticed was that RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be
asclose as possible to the filesystem's block size, to maximize the number of disks involved in concurrently fetching
eachblock.  Does that sound about right? 

So now I'm wondering what RAID-Z offers that RAID-50 doesn't.  I came up with 2 things: an alleged affinity for
full-stripewrites and (under RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 disks to fail
perzpool).  It wasn't mentioned in this blog, but I've heard that under certain circumstances, RAID-Z will magically
decideto mirror a block instead of calculating parity on it.  I'm not sure how this would happen, and I don't know the
circumstancesthat would trigger this behavior, but I think the goal (if it really happens) is to avoid the performance
penaltyof having to read the rest of the stripe required to calculate parity.  As far as I know, this is only an issue
affectingsmall writes (e.g. single-row updates in an OLTP workload), but not large writes (compared to the RAID's
stripesize).  Anyway, when I saw the filesystem's intent log mentioned, I thought maybe the small writes are converted
tofull-stripe writes by deferring their commit until a full stripe's worth of data had been accumulated.  Does that
soundplausible? 

Are there any other noteworthy perks to RAID-Z, rather than RAID-50?  If not, I'm inclined to go with your suggestion,
Dimitri,and use zfs like RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs.  Even though many of our queries do
mostlysequential I/O, getting higher seeks/second is more important to us than the sacrificed diskspace. 

For the record, those blogs also included a link to a very helpful ZFS Best Practices Guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To sum up, so far the short list of tuning suggestions for ZFS includes:
 - Use a separate zpool and filesystem for xlogs if your apps write often.
 - Consider setting zil_disable=1 on the xlogs' dedicated filesystem.  ZIL is the intent log, and it sounds like
disablingit may be like disabling journaling.  Previous message threads in the Postgres archives debate whether this is
safefor the xlogs, but it didn't seem like a conclusive answer was reached. 
 - Make filesystem block size (zfs record size) match the Postgres block size.
 - Manually adjust vdev_cache.  I think this sets the read-ahead size.  It defaults to 64 KB.  For OLTP workload,
reduceit; for DW/OLAP maybe increase it. 
 - Test various settings for vq_max_pending (until zfs can auto-tune it).  See
http://blogs.sun.com/erickustarz/entry/vq_max_pending
 - A zpool of mirrored disks should support more seeks/second than RAID-Z, just like RAID 10 vs. RAID 50.  However, no
singlePostgres backend will see better than a single disk's seek rate, because the executor currently dispatches only 1
logicalI/O request at a time. 


>>> Dimitri <> 03/23/07 2:28 AM >>>
On Friday 23 March 2007 03:20, Matt Smiley wrote:
> My company is purchasing a Sunfire x4500 to run our most I/O-bound
> databases, and I'd like to get some advice on configuration and tuning.
> We're currently looking at: - Solaris 10 + zfs + RAID Z
>  - CentOS 4 + xfs + RAID 10
>  - CentOS 4 + ext3 + RAID 10
> but we're open to other suggestions.
>

Matt,

for Solaris + ZFS you may find answers to all your questions here:

  http://blogs.sun.com/roch/category/ZFS
  http://blogs.sun.com/realneel/entry/zfs_and_databases

Think to measure log (WAL) activity and use separated pool for logs if needed.
Also, RAID-Z is more security-oriented rather performance, RAID-10 should be
a better choice...

Rgds,
-Dimitri



From:
Dimitri
Date:

On Friday 23 March 2007 14:32, Matt Smiley wrote:
> Thanks Dimitri!  That was very educational material!  I'm going to think
> out loud here, so please correct me if you see any errors.

Your mail is so long - I was unable to answer all questions same day :))

>
> The section on tuning for OLTP transactions was interesting, although my
> OLAP workload will be predominantly bulk I/O over large datasets of
> mostly-sequential blocks.

I supposed mostly READ operations, right?

>
> The NFS+ZFS section talked about the zil_disable control for making zfs
> ignore commits/fsyncs.  Given that Postgres' executor does single-threaded
> synchronous I/O like the tar example, it seems like it might benefit
> significantly from setting zil_disable=1, at least in the case of
> frequently flushed/committed writes.  However, zil_disable=1 sounds unsafe
> for the datafiles' filesystem, and would probably only be acceptible for
> the xlogs if they're stored on a separate filesystem and you're willing to
> loose recently committed transactions.  This sounds pretty similar to just
> setting fsync=off in postgresql.conf, which is easier to change later, so
> I'll skip the zil_disable control.

yes, you don't need it for PostgreSQL, it may be useful for other database
vendors, but not here.

>
> The RAID-Z section was a little surprising.  It made RAID-Z sound just like
> RAID 50, in that you can customize the trade-off between iops versus usable
> diskspace and fault-tolerance by adjusting the number/size of
> parity-protected disk groups.  The only difference I noticed was that
> RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be as
> close as possible to the filesystem's block size, to maximize the number of
> disks involved in concurrently fetching each block.  Does that sound about
> right?

Well, look at RAID-Z just as wide RAID solution. More you have disks in your
system - more high is probability you may loose 2 disks on the same time, and
in this case wide RAID-10 will simply make loose you whole the data set (and
again if you loose both disks in mirror pair). So, RAID-Z brings you more
security as you may use wider parity, but the price for it is I/O
performance...

>
> So now I'm wondering what RAID-Z offers that RAID-50 doesn't.  I came up
> with 2 things: an alleged affinity for full-stripe writes and (under
> RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2
> disks to fail per zpool).  It wasn't mentioned in this blog, but I've heard
> that under certain circumstances, RAID-Z will magically decide to mirror a
> block instead of calculating parity on it.  I'm not sure how this would
> happen, and I don't know the circumstances that would trigger this
> behavior, but I think the goal (if it really happens) is to avoid the
> performance penalty of having to read the rest of the stripe required to
> calculate parity.  As far as I know, this is only an issue affecting small
> writes (e.g. single-row updates in an OLTP workload), but not large writes
> (compared to the RAID's stripe size).  Anyway, when I saw the filesystem's
> intent log mentioned, I thought maybe the small writes are converted to
> full-stripe writes by deferring their commit until a full stripe's worth of
> data had been accumulated.  Does that sound plausible?

The problem here that within the same workload you're able to do less I/O
operations with RAID-Z then in RAID-10. So, bigger your I/O block size or
smaller - you'll still obtain lower throughput, no? :)

>
> Are there any other noteworthy perks to RAID-Z, rather than RAID-50?  If
> not, I'm inclined to go with your suggestion, Dimitri, and use zfs like
> RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs.  Even though many
> of our queries do mostly sequential I/O, getting higher seeks/second is
> more important to us than the sacrificed diskspace.

There is still one point to check: if you do mostly READ on your database
probably RAID-Z will be not *too* bad and will give you more used space.
However, if you need to update your data or load frequently - RAID-10 will be
better...

>
> For the record, those blogs also included a link to a very helpful ZFS Best
> Practices Guide:
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

oh yes, it's constantly growing wiki, good start for any Solaris questions as
well performance points :)

>
> To sum up, so far the short list of tuning suggestions for ZFS includes:
>  - Use a separate zpool and filesystem for xlogs if your apps write often.
>  - Consider setting zil_disable=1 on the xlogs' dedicated filesystem.  ZIL
> is the intent log, and it sounds like disabling it may be like disabling
> journaling.  Previous message threads in the Postgres archives debate
> whether this is safe for the xlogs, but it didn't seem like a conclusive
> answer was reached. - Make filesystem block size (zfs record size) match
> the Postgres block size. - Manually adjust vdev_cache.  I think this sets
> the read-ahead size.  It defaults to 64 KB.  For OLTP workload, reduce it;
> for DW/OLAP maybe increase it. - Test various settings for vq_max_pending
> (until zfs can auto-tune it).  See
> http://blogs.sun.com/erickustarz/entry/vq_max_pending - A zpool of mirrored
> disks should support more seeks/second than RAID-Z, just like RAID 10 vs.
> RAID 50.  However, no single Postgres backend will see better than a single
> disk's seek rate, because the executor currently dispatches only 1 logical
> I/O request at a time.

I'm currently just doing OLTP benchmark on ZFS and quite surprising it's
really *doing* several concurrent I/O operations on multi-user workload! :)
Even vacuum seems to run much more faster (or probably it's just my
impression :))
But keep in mind - ZFS is a very young file systems and doing only its first
steps in database workload. So, current goal here is to bring ZFS performance
at least at the same level as UFS is reaching in the same conditions...
Positive news: PostgreSQL seems to me performing much more better than other
database vendors (currently I'm getting at least 80% of UFS performance)...
All tuning points already mentioned previously by you are correct, and I
promise you to publish all other details/findings once I've finished my
tests! (it's too early to get conclusions yet :))

Best regards!
-Dimitri

>
> >>> Dimitri <> 03/23/07 2:28 AM >>>
>
> On Friday 23 March 2007 03:20, Matt Smiley wrote:
> > My company is purchasing a Sunfire x4500 to run our most I/O-bound
> > databases, and I'd like to get some advice on configuration and tuning.
> > We're currently looking at: - Solaris 10 + zfs + RAID Z
> >  - CentOS 4 + xfs + RAID 10
> >  - CentOS 4 + ext3 + RAID 10
> > but we're open to other suggestions.
>
> Matt,
>
> for Solaris + ZFS you may find answers to all your questions here:
>
>   http://blogs.sun.com/roch/category/ZFS
>   http://blogs.sun.com/realneel/entry/zfs_and_databases
>
> Think to measure log (WAL) activity and use separated pool for logs if
> needed. Also, RAID-Z is more security-oriented rather performance, RAID-10
> should be a better choice...
>
> Rgds,
> -Dimitri

From:
"Matt Smiley"
Date:

Hi Dimitri,

First of all, thanks again for the great feedback!

Yes, my I/O load is mostly read operations.  There are some bulk writes done in the background periodically throughout
theday, but these are not as time-sensitive.  I'll have to do some testing to find the best balance of read vs. write
speedand tolerance of disk failure vs. usable diskspace. 

I'm looking forward to seeing the results of your OLTP tests!  Good luck!  Since I won't be doing that myself, it'll be
allnew to me. 

About disk failure, I certainly agree that increasing the number of disks will decrease the average time between disk
failures. Apart from any performance considerations, I wanted to get a clear idea of the risk of data loss under
variousRAID configurations.  It's a handy reference, so I thought I'd share it: 

--------

The goal is to calculate the probability of data loss when we loose a certain number of disks within a short timespan
(e.g.loosing a 2nd disk before replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we will loose data if any
diskgroup (i.e. mirror or parity-group) looses 2 disks.  For RAID 60 and Z2, we will loose data if 3 disks die in the
sameparity group.  The parity groups can include arbitrarily many disks.  Having larger groups gives us more usable
diskspacebut less protection.  (Naturally we're more likely to loose 2 disks in a group of 50 than in a group of 5.) 

    g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
    n = total number of disks
    risk of loosing any 1 disk = 1/n
    risk of loosing 1 disk from a particular group = g/n
    risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
    risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)

For the x4500, we have 48 disks.  If we stripe our data across all those disks, then these are our configuration
options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
              2          24           48            24              0.09%
              3          16           48            32              0.27%
              4          12           48            36              0.53%
              6           8           48            40              1.33%
              8           6           48            42              2.48%
             12           4           48            44              5.85%
             24           2           48            46             24.47%
             48           1           48            47            100.00%

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
              2          24           48           n/a                n/a
              3          16           48            16              0.01%
              4          12           48            24              0.02%
              6           8           48            32              0.12%
              8           6           48            36              0.32%
             12           4           48            40              1.27%
             24           2           48            44             11.70%
             48           1           48            46            100.00%

So, in terms of fault tolerance:
 - RAID 60 and Z2 always beat RAID 10, since they never risk data loss when only 2 disks fail.
 - RAID 10 always beats RAID 50 and Z, since it has the largest number of disk groups across which to spread the risk.
 - Having more parity groups increases fault tolerance but decreases usable diskspace.

That's all assuming each disk has an equal chance of failure, which is probably true since striping should distribute
theworkload evenly.  And again, these probabilities are only describing the case where we don't have enough time
betweendisk failures to recover the array. 

In terms of performance, I think RAID 10 should always be best for write speed.  (Since it doesn't calculate parity,
writinga new block doesn't require reading the rest of the RAID stripe just to recalculate the parity bits.)  I think
it'salso normally just as fast for reading, since the controller can load-balance the pending read requests to both
sidesof each mirror. 

--------



From:
david@lang.hm
Date:

On Tue, 27 Mar 2007, Matt Smiley wrote:

> --------
>
> The goal is to calculate the probability of data loss when we loose a
> certain number of disks within a short timespan (e.g. loosing a 2nd disk
> before replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we
> will loose data if any disk group (i.e. mirror or parity-group) looses 2
> disks.  For RAID 60 and Z2, we will loose data if 3 disks die in the
> same parity group.  The parity groups can include arbitrarily many
> disks.  Having larger groups gives us more usable diskspace but less
> protection.  (Naturally we're more likely to loose 2 disks in a group of
> 50 than in a group of 5.)
>
>    g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
>    n = total number of disks
>    risk of loosing any 1 disk = 1/n

please explain why you are saying that the risk of loosing any 1 disk is
1/n. shouldn't it be probability of failure * n instead?

>    risk of loosing 1 disk from a particular group = g/n
>    risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
>    risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)

following this logic the risk of loosing all 48 disks in a single group of
48 would be 100%

also what you are looking for is the probability of the second (and third)
disks failing in time X (where X is the time nessasary to notice the
failure, get a replacement, and rebuild the disk)

the killer is the time needed to rebuild the disk, with multi-TB arrays
is't sometimes faster to re-initialize the array and reload from backup
then it is to do a live rebuild (the kernel.org servers had a raid failure
recently and HPA mentioned that it took a week to rebuild the array, but
it would have only taken a couple days to do a restore from backup)

add to this the fact that disk failures do not appear to be truely
independant from each other statisticly (see the recent studies released
by google and cmu), and I wouldn't bother with single-parity for a
multi-TB array. If the data is easy to recreate (including from backup) or
short lived (say a database of log data that cycles every month or so) I
would just do RAID-0 and plan on loosing the data on drive failure (this
assumes that you can afford the loss of service when this happens). if the
data is more important then I'd do dual-parity or more, along with a hot
spare so that the rebuild can start as soon as the first failure is
noticed by the system to give myself a fighting chance to save things.


> In terms of performance, I think RAID 10 should always be best for write
> speed.  (Since it doesn't calculate parity, writing a new block doesn't
> require reading the rest of the RAID stripe just to recalculate the
> parity bits.)  I think it's also normally just as fast for reading,
> since the controller can load-balance the pending read requests to both
> sides of each mirror.

this depends on your write pattern. if you are doing sequential writes
(say writing a log archive) then RAID 5 can be faster then RAID 10. since
there is no data there to begin with the system doesn't have to read
anything to calculate the parity, and with the data spread across more
spindles you have a higher potential throughput.

if your write pattern is is more random, and especially if you are
overwriting existing data then the reads needed to calculate the parity
will slow you down.

as for read speed, it all depends on your access pattern and stripe size.
if you are reading data that spans disks (larger then your stripe size)
you end up with a single read tieing up multiple spindles. with Raid 1
(and varients) you can read from either disk of the set if you need
different data within the same stripe that's on different disk tracks (if
it's on the same track you'll get it just as fast reading from a single
drive, or so close to it that it doesn't matter). beyond that the question
is how many spindles can you keep busy reading (as opposed to seeking to
new data or sitting idle becouse you don't need their data)

the worst case for reading is to be jumping through your data in strides
of stripe*# disks available (accounting for RAID type) as all your reads
will end up hitting the same disk.

David Lang

From:
"Matt Smiley"
Date:

Hi David,

Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the critique.

First, let me correct myself: The formulas for the risk of loosing data when you loose 2 and 3 disks shouldn't have
includedthe first term (g/n).  I'll give the corrected formulas and tables at the end of the email. 


> please explain why you are saying that the risk of loosing any 1 disk is
> 1/n. shouldn't it be probability of failure * n instead?

1/n represents the assumption that all disks have an equal probability of being the next one to fail.  This seems like
afair assumption in general for the active members of a stripe (not including hot spares).  A possible exception would
bethe parity disks (because reads always skip them and writes always hit them), but that's only a consideration if the
RAIDconfiguration used dedicated disks for parity instead of distributing it across the RAID 5/6 group members.  Apart
fromthat, whether the workload is write-heavy or read-heavy, sequential or scattered, the disks in the stripe ought to
handlea roughly equivalent number of iops over their lifetime. 


> following this logic the risk of loosing all 48 disks in a single group of
> 48 would be 100%

Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If you loose even 1 active member of the
stripe,the probability of loosing your data is 100%. 


> also what you are looking for is the probability of the second (and third)
> disks failing in time X (where X is the time nessasary to notice the
> failure, get a replacement, and rebuild the disk)

Yep, that's exactly what I'm looking for.  That's why I said, "these probabilities are only describing the case where
wedon't have enough time between disk failures to recover the array."  My goal wasn't to estimate how long time X is.
(Itdoesn't seem like a generalizable quantity; due partly to logistical and human factors, it's unique to each
operatingenvironment.)  Instead, I start with the assumption that time X has been exceeded, and we've lost a 2nd (or
3rd)disk in the array.  Given that assumption, I wanted to show the probability that the loss of the 2nd disk has
causedthe stripe to become unrecoverable. 

We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 and n/g disks, depending on how lucky you are.
Iwanted to quantify the amount of luck required, as a risk management tool.  The duration of time X can be minimized
withhot spares and attentive administrators, but the risk after exceeding time X can only be minimized (as far as I
know)by configuring the RAID stripe with small enough underlying failure groups. 


> the killer is the time needed to rebuild the disk, with multi-TB arrays
> is't sometimes faster to re-initialize the array and reload from backup
> then it is to do a live rebuild (the kernel.org servers had a raid failure
> recently and HPA mentioned that it took a week to rebuild the array, but
> it would have only taken a couple days to do a restore from backup)

That's very interesting.  I guess the rebuild time also would depend on how large the damaged failure group was.  Under
RAID10, for example, I think you'd still only have to rebuild 1 disk from its mirror, regardless of how many other
diskswere in the stripe, right?  So shortening the rebuild time may be another good motivation to keep the failure
groupssmall. 


> add to this the fact that disk failures do not appear to be truely
> independant from each other statisticly (see the recent studies released
> by google and cmu), and I wouldn't bother with single-parity for a

I don't think I've seen the studies you mentioned.  Would you cite them please?  This may not be typical of everyone's
experience,but what I've seen during in-house load tests is an equal I/O rate for each disk in my stripe, using
short-durationsampling intervals to avoid long-term averaging effects.  This is what I expected to find, so I didn't
delvedeeper. 

Certainly it's true that some disks may be more heavily burdened than others for hours or days, but I wouldn't expect
anybias from an application-driven access pattern to persist for a significant fraction of a disk's lifespan.  The only
influenceI'd expect to bias the cumulative I/O handled by a disk over its entire life would be its role in the RAID
configuration. Hot spares will have minimal wear-and-tear until they're activated.  Dedicated parity disks will
probablylive longer than data disks, unless the workload is very heavily oriented towards small writes (e.g. logging). 


> multi-TB array. If the data is easy to recreate (including from backup) or
> short lived (say a database of log data that cycles every month or so) I
> would just do RAID-0 and plan on loosing the data on drive failure (this
> assumes that you can afford the loss of service when this happens). if the
> data is more important then I'd do dual-parity or more, along with a hot
> spare so that the rebuild can start as soon as the first failure is
> noticed by the system to give myself a fighting chance to save things.

That sounds like a fine plan.  In my case, downtime is unacceptible (which is, of course, why I'm interested in
quantifyingthe probabilities of data loss). 


Here are the corrected formulas:

Let:
   g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
   n = total number of disks
   risk of loosing any 1 disk = 1/n
Then we have:
   risk of loosing 1 disk from a particular group = g/n
   risk of loosing 2 disks in the same group = (g-1)/(n-1)
   risk of loosing 3 disks in the same group = (g-1)/(n-1) * (g-2)/(n-2)

For the x4500, we have 48 disks.  If we stripe our data across all those disks, then these are our configuration
options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
             2          24           48            24              2.13%
             3          16           48            32              4.26%
             4          12           48            36              6.38%
             6           8           48            40             10.64%
             8           6           48            42             14.89%
            12           4           48            44             23.40%
            16           3           48            45             31.91%
            24           2           48            46             48.94%
            48           1           48            47            100.00%

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
             2          24           48           n/a                n/a
             3          16           48            16              0.09%
             4          12           48            24              0.28%
             6           8           48            32              0.93%
             8           6           48            36              1.94%
            12           4           48            40              5.09%
            16           3           48            42              9.71%
            24           2           48            44             23.40%
            48           1           48            46            100.00%



From:
david@lang.hm
Date:

On Thu, 29 Mar 2007, Matt Smiley wrote:

> Hi David,
>
> Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the critique.
>
> First, let me correct myself: The formulas for the risk of loosing data when you loose 2 and 3 disks shouldn't have
includedthe first term (g/n).  I'll give the corrected formulas and tables at the end of the email. 
>
>
>> please explain why you are saying that the risk of loosing any 1 disk is
>> 1/n. shouldn't it be probability of failure * n instead?
>
> 1/n represents the assumption that all disks have an equal probability of being the next one to fail.  This seems
likea fair assumption in general for the active members of a stripe (not including hot spares).  A possible exception
wouldbe the parity disks (because reads always skip them and writes always hit them), but that's only a consideration
ifthe RAID configuration used dedicated disks for parity instead of distributing it across the RAID 5/6 group members.
Apartfrom that, whether the workload is write-heavy or read-heavy, sequential or scattered, the disks in the stripe
oughtto handle a roughly equivalent number of iops over their lifetime. 
>

only assuming that you have a 100% chance of some disk failing. if you
have 15 disks in one array and 60 disks in another array the chances of
having _some_ failure in the 15 disk array is only 1/4 the chance of
having a failure of _some_ disk in the 60 disk array

>
>> following this logic the risk of loosing all 48 disks in a single group of
>> 48 would be 100%
>
> Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If you loose even 1 active member of the
stripe,the probability of loosing your data is 100%. 

but by your math, the chance of failure with dual parity if a 48 disk
raid5 was also 100%, this is just wrong.

>
>> also what you are looking for is the probability of the second (and third)
>> disks failing in time X (where X is the time nessasary to notice the
>> failure, get a replacement, and rebuild the disk)
>
> Yep, that's exactly what I'm looking for.  That's why I said, "these
> probabilities are only describing the case where we don't have enough
> time between disk failures to recover the array."  My goal wasn't to
> estimate how long time X is.  (It doesn't seem like a generalizable
> quantity; due partly to logistical and human factors, it's unique to
> each operating environment.)  Instead, I start with the assumption that
> time X has been exceeded, and we've lost a 2nd (or 3rd) disk in the
> array.  Given that assumption, I wanted to show the probability that the
> loss of the 2nd disk has caused the stripe to become unrecoverable.

Ok, this is the chance that if you loose that N disks without replacing
any of them how much data are you likly to loose in different arrays.

> We know that RAID 10 and 50 can tolerate the loss of anywhere between 1
> and n/g disks, depending on how lucky you are.  I wanted to quantify the
> amount of luck required, as a risk management tool.  The duration of
> time X can be minimized with hot spares and attentive administrators,
> but the risk after exceeding time X can only be minimized (as far as I
> know) by configuring the RAID stripe with small enough underlying
> failure groups.

but I don't think this is the question anyone is really asking.

what people want to know isn't 'how many disks can I loose without
replacing them before I loose data' what they want to know is ' with this
configuration (including a drive replacement time of Y for the first N
drives and Z for drives after that), what are the odds of loosing data'

and for the second question the chance of failure of additional disks
isn't 100%.

>
>> the killer is the time needed to rebuild the disk, with multi-TB arrays
>> is't sometimes faster to re-initialize the array and reload from backup
>> then it is to do a live rebuild (the kernel.org servers had a raid failure
>> recently and HPA mentioned that it took a week to rebuild the array, but
>> it would have only taken a couple days to do a restore from backup)
>
> That's very interesting.  I guess the rebuild time also would depend on
> how large the damaged failure group was.  Under RAID 10, for example, I
> think you'd still only have to rebuild 1 disk from its mirror,
> regardless of how many other disks were in the stripe, right?  So
> shortening the rebuild time may be another good motivation to keep the
> failure groups small.
>

correct, however you have to decide how much this speed is worth to you.
if you are building a ~20TB array you can do this with ~30 drives with
single or dual parity, or ~60 drives with RAID 10.

remember the big cost of arrays like this isn't even the cost of the
drives (although you are talking an extra $20,000 or so there), but the
cost of the power and cooling to run all those extra drives

>> add to this the fact that disk failures do not appear to be truely
>> independant from each other statisticly (see the recent studies released
>> by google and cmu), and I wouldn't bother with single-parity for a
>
> I don't think I've seen the studies you mentioned.  Would you cite them
> please?

http://labs.google.com/papers/disk_failures.pdf

http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html

> This may not be typical of everyone's experience, but what I've
> seen during in-house load tests is an equal I/O rate for each disk in my
> stripe, using short-duration sampling intervals to avoid long-term
> averaging effects.  This is what I expected to find, so I didn't delve
> deeper.
>
> Certainly it's true that some disks may be more heavily burdened than
> others for hours or days, but I wouldn't expect any bias from an
> application-driven access pattern to persist for a significant fraction
> of a disk's lifespan.  The only influence I'd expect to bias the
> cumulative I/O handled by a disk over its entire life would be its role
> in the RAID configuration.  Hot spares will have minimal wear-and-tear
> until they're activated.  Dedicated parity disks will probably live
> longer than data disks, unless the workload is very heavily oriented
> towards small writes (e.g. logging).
>
>
>> multi-TB array. If the data is easy to recreate (including from backup) or
>> short lived (say a database of log data that cycles every month or so) I
>> would just do RAID-0 and plan on loosing the data on drive failure (this
>> assumes that you can afford the loss of service when this happens). if the
>> data is more important then I'd do dual-parity or more, along with a hot
>> spare so that the rebuild can start as soon as the first failure is
>> noticed by the system to give myself a fighting chance to save things.
>
> That sounds like a fine plan.  In my case, downtime is unacceptible
> (which is, of course, why I'm interested in quantifying the
> probabilities of data loss).
>
>
> Here are the corrected formulas:
>
> Let:
>   g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
>   n = total number of disks
>   risk of loosing any 1 disk = 1/n
> Then we have:
>   risk of loosing 1 disk from a particular group = g/n

assuming you loose one disk

>   risk of loosing 2 disks in the same group = (g-1)/(n-1)

assuming that you loose two disks without replaceing either one (including
not having a hot-spare)

>   risk of loosing 3 disks in the same group = (g-1)/(n-1) * (g-2)/(n-2)

assuming that you loose three disks without replacing any of them
(including not having a hot spare)

> For the x4500, we have 48 disks.  If we stripe our data across all those
> disks, then these are our configuration options:

> RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48            24              2.13%
>             3          16           48            32              4.26%
>             4          12           48            36              6.38%
>             6           8           48            40             10.64%
>             8           6           48            42             14.89%
>            12           4           48            44             23.40%
>            16           3           48            45             31.91%
>            24           2           48            46             48.94%
>            48           1           48            47            100.00%

however, back in the real world, the chances of loosing three disks is
considerably less then the chance of loosing two disks. so to compare
apples to apples you need to add the following

chance of data loss if useing double-parity 0% in all configurations.

> RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48           n/a                n/a
>             3          16           48            16              0.09%
>             4          12           48            24              0.28%
>             6           8           48            32              0.93%
>             8           6           48            36              1.94%
>            12           4           48            40              5.09%
>            16           3           48            42              9.71%
>            24           2           48            44             23.40%
>            48           1           48            46            100.00%

again, to compare apples to apples you would need to add the following
(calculating the odds for each group, they will be scareily larger then
the 2-drive failure chart)

> RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48            24
>             3          16           48            32
>             4          12           48            36
>             6           8           48            40
>             8           6           48            42
>            12           4           48            44
>            16           3           48            45
>            24           2           48            46
>            48           1           48            47

however, since it's easy to add a hot-spare drive, you really need to
account for it. there's still a chance of all the drives going bad before
the hot-spare can be built to replace the first one, but it's a lot lower
then if you don't have a hot-spare and require the admins to notice and
replace the failed disk.

if you say that there is a 10% chance of a disk failing each year
(significnatly higher then the studies listed above, but close enough)
then this works out to ~0.001% chance of a drive failing per hour (a
reasonably round number to work with)

to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50
hours at 10% of the system throughput (background rebuilding)

if we cut this in half to account for inefficiancies in retrieving data
from other disks to calculate pairity it can take 100 hours (just over
four days) to do a background rebuild, or about 0.1% chance for each disk
of loosing a seond disk. with 48 drives this is ~5% chance of loosing
everything with single-parity, however the odds of loosing two disks
during this time are .25% so double-parity is _well_ worth it.

chance of loosing data before hotspare is finished rebuilding (assumes one
hotspare per group, you may be able to share a hotspare between multiple
groups to get slightly higher capacity)

> RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48           n/a                n/a
>             3          16           48           n/a         (0.0001% with manual replacement of drive)
>             4          12           48            12         0.0009%
>             6           8           48            24         0.003%
>             8           6           48            30         0.006%
>            12           4           48            36         0.02%
>            16           3           48            39         0.03%
>            24           2           48            42         0.06%
>            48           1           48            45         0.25%

> RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48            n/a        (~0.1% with manual replacement of drive)
>             3          16           48            16         0.2%
>             4          12           48            24         0.3%
>             6           8           48            32         0.5%
>             8           6           48            36         0.8%
>            12           4           48            40         1.3%
>            16           3           48            42         1.7%
>            24           2           48            44         2.5%
>            48           1           48            46         5%

so if I've done the math correctly the odds of losing data with the
worst-case double-parity (one large array including hotspare) are about
the same as the best case single parity (mirror+ hotspare), but with
almost triple the capacity.

David Lang