Home > mailing lists
Re: SCSI vs SATA - Mailing list pgsql-performance

From	david@lang.hm
Subject	Re: SCSI vs SATA
Date	April 7, 2007 18:37:21
Msg-id	Pine.LNX.4.64.0704071333350.28411@asgard.lang.hm Whole thread Raw
In response to	Re: SCSI vs SATA (Ron <rjpeace@earthlink.net>)
Responses	Re: SCSI vs SATA
List	pgsql-performance
Tree view
On Sat, 7 Apr 2007, Ron wrote:

> The reality is that all modern HDs are so good that it's actually quite rare
> for someone to suffer a data loss event.  The consequences of such are so
> severe that the event stands out more than just the statistics would imply.
> For those using small numbers of HDs, HDs just work.
>
> OTOH, for those of us doing work that involves DBMSs and relatively large
> numbers of HDs per system, both the math and the RW conditions of service
> require us to pay more attention to quality details.
> Like many things, one can decide on one of multiple ways to "pay the piper".
>
> a= The choice made by many, for instance in the studies mentioned, is to
> minimize initial acquisition cost and operating overhead and simply accept
> having to replace HDs more often.
>
> b= For those in fields were this is not a reasonable option (financial
> services, health care, etc), or for those literally using 100's of HD per
> system (where statistical failure rates are so likely that TLC is required),
> policies and procedures like those mentioned in this thread (paying close
> attention to environment and use factors, sector remap detecting, rotating
> HDs into and out of roles based on age, etc) are necessary.
>
> Anyone who does some close variation of "b" directly above =will= see the
> benefits of using better HDs.
>
> At least in my supposedly unqualified anecdotal 25 years of professional
> experience.

Ron, why is it that you assume that anyone who disagrees with you doesn't
work in an environment where they care about the datacenter environment,
and aren't in fields like financial services? and why do you think that we
are just trying to save a few pennies? (the costs do factor in, but it's
not a matter of pennies, it's a matter of tens of thousands of dollars)

I actually work in the financial services field, I do have a good
datacenter environment that's well cared for.

while I don't personally maintain machines with hundreds of drives each, I
do maintain hundreds of machines with a small number of drives in each,
and a handful of machines with a few dozens of drives. (the database
machines are maintained by others, I do see their failed drives however)

it's also true that my expericance is only over the last 10 years, so I've
only been working with a few generations of drives, but my experiance is
different from yours.

my experiance is that until the drives get to be 5+ years old the failure
rate seems to be about the same for the 'cheap' drives as for the 'good'
drives. I won't say that they are exactly the same, but they are close
enough that I don't believe that there is a significant difference.

in other words, these studies do seem to match my experiance.

this is why, when I recently had to create some large capacity arrays, I'm
only ending up with machines with a few dozen drives in them instead of
hundreds. I've got two machines with 6TB of disk, one with 8TB, one with
10TB, and one with 20TB. I'm building these sytems for ~$1K/TB for the
disk arrays. other departments sho shoose $bigname 'enterprise' disk
arrays are routinely paying 50x that price

I am very sure that they are not getting 50x the reliability, I'm sure
that they aren't getting 2x the reliability.

I believe that the biggest cause for data loss from people useing the
'cheap' drives is due to the fact that one 'cheap' drive holds the
capacity of 5 or so 'expensive' drives, and since people don't realize
this they don't realize that the time to rebuild the failed drive onto a
hot-spare is correspondingly longer.

in the thread 'Sunfire X4500 recommendations' we recently had a discussion
on this topic starting from a guy who was asking the best way to configure
the drives in his sun x4500 (48 drive) system for safety. in that
discussion I took some numbers from the cmu study and as a working figure
I said a 10% chance for a drive to fail in a year (the study said 5-7% in
most cases, but some third year drives were around 10%). combining this
with the time needed to write 750G useing ~10% of the systems capacity
results in a rebuild time of about 5 days. it turns out that there is
almost a 5% chance of a second drive failing in a 48 drive array in this
time. If I were to build a single array with 142G 'enterprise' drives
instead of with 750G 'cheap' drives the rebuild time would be only 1 day
instead of 5, but you would have ~250 drives instead of 48 and so your
chance of a problem would be the same (I acknoledge that it's unlikly to
use 250 drives in a single array, and yes that does help, however if you
had 5 arrays of 50 drives each you would still have a 1% chance of a
second failure)

when I look at these numbers, my reaction isn't that it's wrong to go with
the 'cheap' drives, my reaction is that single reducndancy isn't good
enough. depending on how valuble the data is, you need to either replicate
the data to another system, or go with dual-parity redundancy (or both)

while drives probably won't be this bad in real life (this is after all,
slightly worse then the studies show for their 3rd year drives, and
'enterprise' drives may be slightly better) , I have to assume that they
will be for my reliability planning.

also, if you read throught the cmu study, drive failures were only a small
percentage of system outages (16-25% depending on the site). you have to
make sure that you aren't so fixated on drive reliabilty that you fail to
account for other types of problems (down to and including the chance of
someone accidently powering down the rack that you are plugged into, be
it from hitting a power switch, to overloading a weak circuit breaker)

In looking at these problems overall I find that in most cases I need to
have redundant systems with the data replicated anyway (with logs sent
elsewhere), so I can get away with building failover pairs instead of
having each machine with redundant drives. I've found that I can
frequently get a pair of machines for less money then other departments
spend on buying a single 'enterprise' machine with the same specs
(although the prices are dropping enough on the top-tier manufacturers
that this is less true today then it was a couple of years ago), and I
find that the failure rate is about the same on a per-machine basis, so I
end up with a much better uptime record due to having the redundancy of
the second full system (never mind things like it being easier to do
upgrades as I can work on the inactive machine and then failover to work
on the other, now, inactive machine). while I could ask for the budget to
be doubled to provide the same redundancy with the top-tier manufacturers
I don't do so for several reasons, the top two being that these
manufacurers frequently won't configure a machine the way I want them to
(just try to get a box with writeable media built in, either a floppy of a
CDR/DVDR, they want you to use something external), and doing so also
exposes me to people second guessing me on where redundancy is needed
('that's only development, we don't need redundancy there', until a system
goes down for a day and the entire department is unable to work)

it's not that the people who disagree with you don't care about their
data, it's that they have different experiances then you do (experiances
that come close to matching the studies where they tracked hundereds of
thousands of drives of different types), and as a result believe that the
difference (if any) between the different types of drives isn't
significant in the overall failure rate (especially when you take the
difference of drive capacity into account)

David Lang

P.S. here is a chart from that thread showing the chances of loosing data
with different array configurations.

if you say that there is a 10% chance of a disk failing each year
(significnatly higher then the studies listed above, but close enough)
then this works out to ~0.001% chance of a drive failing per hour (a
reasonably round number to work with)

to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50
hours at 10% of the system throughput (background rebuilding)

if we cut this in half to account for inefficiancies in retrieving data
from other disks to calculate pairity it can take 100 hours (just over
four days) to do a background rebuild, or about 0.1% chance for each disk
of loosing a seond disk. with 48 drives this is ~5% chance of loosing
everything with single-parity, however the odds of loosing two disks
during this time are .25% so double-parity is _well_ worth it.

chance of loosing data before hotspare is finished rebuilding (assumes one
hotspare per group, you may be able to share a hotspare between multiple
groups to get slightly higher capacity)

> RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48           n/a                n/a
>             3          16           48           n/a         (0.0001% with manual replacement of drive)
>             4          12           48            12         0.0009%
>             6           8           48            24         0.003%
>             8           6           48            30         0.006%
>            12           4           48            36         0.02%
>            16           3           48            39         0.03%
>            24           2           48            42         0.06%
>            48           1           48            45         0.25%

> RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48            n/a        (~0.1% with manual replacement of drive)
>             3          16           48            16         0.2%
>             4          12           48            24         0.3%
>             6           8           48            32         0.5%
>             8           6           48            36         0.8%
>            12           4           48            40         1.3%
>            16           3           48            42         1.7%
>            24           2           48            44         2.5%
>            48           1           48            46         5%

so if I've done the math correctly the odds of losing data with the
worst-case double-parity (one large array including hotspare) are about
the same as the best case single parity (mirror+ hotspare), but with
almost triple the capacity.
pgsql-performance by date:
From: Arjen van der Meijden
Date: 07 April 2007, 14:27:59
Subject: Re: fast DISTINCT or EXIST
From: Ron
Date: 07 April 2007, 21:47:16
Subject: Re: SCSI vs SATA
Re: SCSI vs SATA - Mailing list pgsql-performance

Previous

Next