Thread: Fwd: Re: SSDD reliability

Fwd: Re: SSDD reliability

From

David Boreham

Date:

04 May 2011, 16:50:16

>No problem with that, for a first step. ***BUT*** the failures in this article and
>many others I've read about are not in high-write db workloads, so they're not write wear,
>they're just crappy electronics failing.

As a (lapsed) electronics design engineer, I'm suspicious of the notion that
a subassembly consisting of solid state devices surface-mounted on FR4 substrate will fail
except in very rare (and of great interest to the manufacturer) circumstances.
And especially suspicious that one product category (SSD) happens to have a much
higher failure rate than all others.

Consider that an SSD is much simpler (just considering the electronics) than a traditional
disk drive, and subject to less vibration and heat.
Therefore one should see disk drives failing at the same (or higher rate).
Even if the owner is highly statically charged, you'd expect the to destroy all categories
of electronics at roughly the same rate (rather than just SSDs).

So if someone says that SSDs have "failed", I'll assume that they suffered from Flash cell
wear-out unless there is compelling proof to the contrary.

Re: Fwd: Re: SSDD reliability

From

Greg Smith

Date:

04 May 2011, 21:00:41

On 05/04/2011 03:24 PM, David Boreham wrote:
> So if someone says that SSDs have "failed", I'll assume that they
> suffered from Flash cell
> wear-out unless there is compelling proof to the contrary.

I've been involved in four recovery situations similar to the one
described in that coding horror article, and zero of them were flash
wear-out issues.  The telling sign is that the device should fail to
read-only mode if it wears out.  That's not what I've seen happen
though; what reports from the field are saying is that sudden, complete
failures are the more likely event.

The environment inside a PC of any sort, desktop or particularly
portable, is not a predictable environment.  Just because the drives
should be less prone to heat and vibration issues doesn't mean
individual components can't slide out of spec because of them.  And hard
drive manufacturers have a giant head start at working out reliability
bugs in that area.  You can't design that sort of issue out of a new
product in advance; all you can do is analyze returns from the field,
see what you screwed up, and do another design rev to address it.

The idea that these new devices, which are extremely complicated and
based on hardware that hasn't been manufactured in volume before, should
be expected to have high reliability is an odd claim.  I assume that any
new electronics gadget has an extremely high failure rate during its
first few years of volume production, particularly from a new
manufacturer of that product.

Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards.  I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can easily
be less reliable than a regular hard drive still.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: Fwd: Re: SSDD reliability

From

David Boreham

Date:

04 May 2011, 21:32:18

On 5/4/2011 6:02 PM, Greg Smith wrote:
> On 05/04/2011 03:24 PM, David Boreham wrote:
>> So if someone says that SSDs have "failed", I'll assume that they
>> suffered from Flash cell
>> wear-out unless there is compelling proof to the contrary.
>
> I've been involved in four recovery situations similar to the one
> described in that coding horror article, and zero of them were flash
> wear-out issues.  The telling sign is that the device should fail to
> read-only mode if it wears out.  That's not what I've seen happen
> though; what reports from the field are saying is that sudden,
> complete failures are the more likely event.

Sorry to harp on this (last time I promise), but I somewhat do know what
I'm talking about, and I'm quite motivated to get to the bottom of this
"SSDs fail, but not for the reason you'd suspect" syndrome (because we
want to deploy SSDs in production soon).

Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot up
and respond to host commands due to the wear-out state. So rather than
the expected outcome (SSD responds but has read-only behavior), it
appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative (SSD
vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.

One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the failure
cause reported?).
>
> The environment inside a PC of any sort, desktop or particularly
> portable, is not a predictable environment.  Just because the drives
> should be less prone to heat and vibration issues doesn't mean
> individual components can't slide out of spec because of them.  And
> hard drive manufacturers have a giant head start at working out
> reliability bugs in that area.  You can't design that sort of issue
> out of a new product in advance; all you can do is analyze returns
> from the field, see what you screwed up, and do another design rev to
> address it.
That's not really how it works (I've been the guy responsible for this
for 10 years in a prior career, so I feel somewhat qualified to argue
about this). The technology and manufacturing processes are common
across many different types of product. They either all work , or they
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on
the exact same production lines as regular disk drives, DRAM modules,
and so on (manufacturing tends to be contracted to high volume factories
that make all kinds of things on the same lines). The only different
thing about SSDs vs. any other electronics you'd come across is the
Flash devices themselves. However, those are used in extraordinary high
volumes all over the place and if there were a failure mode with the
incidence suggested by these stories, I suspect we'd be reading about it
on the front page of the WSJ.

>
> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
> mechanical drives is around 2% during their first year, spiking to 5%
> afterwards.  I suspect that Intel's numbers are actually much better
> than the other manufacturers here, so a SSD from anyone else can
> easily be less reliable than a regular hard drive still.
>
Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??

For the benefit of anyone reading this who may have a failed SSD : all
the tier 1 manufacturers have departments dedicated to the analysis of
product that fails in the field. With some persistence, you can usually
get them to take a failed unit and put it through the FA process (and
tell you why it failed). For example, here's a job posting for someone
who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the
failure analysis pile. If units are not returned, the manufacturer never
finds out what broke, and therefore can't fix the problem.

Re: Fwd: Re: SSDD reliability

From

Scott Marlowe

Date:

05 May 2011, 00:06:48

On Wed, May 4, 2011 at 6:31 PM, David Boreham <david_list@boreham.org> wrote:
>
> this). The technology and manufacturing processes are common across many
> different types of product. They either all work , or they all fail.

Most of it is.  But certain parts are fairly new, i.e. the
controllers.  It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs.  If that's the case
then the just plain up and dying thing makes some sense.

Re: Fwd: Re: SSDD reliability

From

David Boreham

Date:

05 May 2011, 00:34:27

On 5/4/2011 9:06 PM, Scott Marlowe wrote:
> Most of it is.  But certain parts are fairly new, i.e. the
> controllers.  It is quite possible that all these various failing
> drives share some long term ~ 1 year degradation issue like the 6Gb/s
> SAS ports on the early sandybridge Intel CPUs.  If that's the case
> then the just plain up and dying thing makes some sense.

That Intel SATA port circuit issue was an extraordinarily rare screwup.

So ok, yeah...I said that chips don't just keel over and die mid-life
and you came up with the one counterexample in the history of
the industry :)  When I worked in the business in the 80's and 90's
we had a few things like this happen, but they're very rare and
typically don't escape into the wild (as Intel's pretty much didn't).
If a similar problem affected SSDs, they would have been recalled
and lawsuits would be underway.

SSDs are just not that different from anything else.
No special voodoo technology (besides the Flash devices themselves).

Re: Fwd: Re: SSDD reliability

From

Scott Marlowe

Date:

05 May 2011, 00:46:37

On Wed, May 4, 2011 at 9:34 PM, David Boreham <david_list@boreham.org> wrote:
> On 5/4/2011 9:06 PM, Scott Marlowe wrote:
>>
>> Most of it is.  But certain parts are fairly new, i.e. the
>> controllers.  It is quite possible that all these various failing
>> drives share some long term ~ 1 year degradation issue like the 6Gb/s
>> SAS ports on the early sandybridge Intel CPUs.  If that's the case
>> then the just plain up and dying thing makes some sense.
>
> That Intel SATA port circuit issue was an extraordinarily rare screwup.
>
> So ok, yeah...I said that chips don't just keel over and die mid-life
> and you came up with the one counterexample in the history of
> the industry :)  When I worked in the business in the 80's and 90's
> we had a few things like this happen, but they're very rare and
> typically don't escape into the wild (as Intel's pretty much didn't).
> If a similar problem affected SSDs, they would have been recalled
> and lawsuits would be underway.

Not necessarily.  If there's a chip that has a 15% failure rate
instead of the predicted <1% it might not fail enough for people to
have noticed, since a user with a typically small sample might think
he just got a bit unlucky etc.  Nvidia made GPUs that overheated and
died by the thousand, but took 1 to 2 years to die.  There WAS a
lawsuit, and now to settle it, they're offering to buy everybody who
got stuck with the broken GPUs a nice single core $279 Compaq
computer, even if they bought a $4,000 workstation with one of those
dodgy GPUs.

There's a lot of possibilities as to why some folks are seeing high
failure rates, it'd be nice to know the cause.  But we can't assume
it's not an inherent problem with some part in them any more than we
can assume that it is.

Re: Fwd: Re: SSDD reliability

From

Florian Weimer

Date:

05 May 2011, 05:36:26

* Greg Smith:

> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
> mechanical drives is around 2% during their first year, spiking to 5%
> afterwards.  I suspect that Intel's numbers are actually much better
> than the other manufacturers here, so a SSD from anyone else can
> easily be less reliable than a regular hard drive still.

I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point.  With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Fwd: Re: SSDD reliability

From

David Boreham

Date:

05 May 2011, 09:33:30

On 5/5/2011 2:36 AM, Florian Weimer wrote:
>
> I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
> in a RAID-1 configuration are weared down in the same way, and it would
> be rather inconvenient if they failed at the same point.  With hard
> disks, this doesn't seem to happen; even bad batches fail pretty much
> randomly.
>
fwiw this _can_ happen with traditional drives : we had a bunch of WD
300G velociraptor
drives that had a firmware bug related to a 32-bit counter roll-over.
This happened at
exactly the same time for all drives in a machine (because the counter
counted since
power-up time). Needless to say this was quite frustrating !

Re: SSDD reliability

From

Scott Ribe

Date:

05 May 2011, 11:04:50

On May 4, 2011, at 9:34 PM, David Boreham wrote:

> So ok, yeah...I said that chips don't just keel over and die mid-life
> and you came up with the one counterexample in the history of
> the industry

Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a
while.Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component
manufacturerscutting corners on specs or selling outright counterfeit parts... 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

Re: SSDD reliability

From

David Boreham

Date:

05 May 2011, 11:36:11

On 5/5/2011 8:04 AM, Scott Ribe wrote:
>
> Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a
while.Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component
manufacturerscutting corners on specs or selling outright counterfeit parts... 

These are excellent examples of failure causes for electronics, but they are
not counter-examples. They're unrelated to the discussion about SSD
early lifetime hard failures.

Re: SSDD reliability

From

Greg Smith

Date:

05 May 2011, 16:52:56

On 05/05/2011 10:35 AM, David Boreham wrote:
> On 5/5/2011 8:04 AM, Scott Ribe wrote:
>>
>> Actually, any of us who really tried could probably come up with a
>> dozen examples--more if we've been around for a while. Original
>> design cutting corners on power regulation; final manufacturers
>> cutting corners on specs; component manufacturers cutting corners on
>> specs or selling outright counterfeit parts...
>
> These are excellent examples of failure causes for electronics, but
> they are
> not counter-examples. They're unrelated to the discussion about SSD
> early lifetime hard failures.

That's really optimistic.  For all we know, these problems are the
latest incarnation of something like the bulging capacitor plague circa
5 years ago.  Some part that is unique to the SSDs other than the flash
cells that there's a giant bad batch of.

I think your faith in PC component manufacturing is out of touch with
the actual field failure rates for this stuff, which is produced with
enormous cost cutting pressure driving tolerances to the bleeding edge
in many cases.  The equipment of the 80's and 90's you were referring to
ran slower, and was more expensive so better quality components could be
justified.  The quality trend at the board and component level has been
trending for a long time toward cheap over good in almost every case
nowadays.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: SSDD reliability

From

Scott Marlowe

Date:

05 May 2011, 17:39:41

On Thu, May 5, 2011 at 1:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:

> I think your faith in PC component manufacturing is out of touch with the
> actual field failure rates for this stuff, which is produced with enormous
> cost cutting pressure driving tolerances to the bleeding edge in many cases.
>  The equipment of the 80's and 90's you were referring to ran slower, and
> was more expensive so better quality components could be justified.  The
> quality trend at the board and component level has been trending for a long
> time toward cheap over good in almost every case nowadays.

Modern CASE tools make this more and more of an issue.  You can be in
a circuit design program, right click on a component and pick from a
dozen other components with lower tolerances and get a SPICE
simulation that says initial production line failure rates will go
from 0.01% to 0.02%.  Multiply that times 100 components and it seems
like a small change.  But all it takes is one misstep and you've got a
board with a theoretical production line failure rate of 0.05 that's
really 0.08, and the first year failure rate goes from 0.5% to 2 or 3%
and the $2.00 you saved on all components on the board times 1M units
goes right out the window.

BTW, the common term we used to refer to things that fail due to weird
and unforseen circumstances were often referred to as P.O.M.
dependent, (phase of the moon) because they'd often cluster around
certain operating conditions that were unobvious until you collected
and collated a large enough data set.  Like hard drives that have
abnormally high failure rates at altitudes above 4500ft etc.  Seem
fine til you order 1,000 for your Denver data center and they all
start failing.  It could be anything like that.  SSDs that operate
fine until they're in an environment with constant % humidity below
15% and boom they start failing like mad.  It's impossible to test for
all conditions in the field, and it's quite possible that
environmental factors affect some of these SSDs we've heard about.
More research is necessary to say why someone would see such
clustering though.

Re: Fwd: Re: SSDD reliability

From

Greg Smith

Date:

05 May 2011, 18:29:37

On 05/04/2011 08:31 PM, David Boreham wrote:
> Here's my best theory at present : the failures ARE caused by cell
> wear-out, but the SSD firmware is buggy in so far as it fails to boot
> up and respond to host commands due to the wear-out state. So rather
> than the expected outcome (SSD responds but has read-only behavior),
> it appears to be (and is) dead. At least to my mind, this is a more
> plausible explanation for the reported failures vs. the alternative
> (SSD vendors are uniquely clueless at making basic electronics
> subassemblies), especially considering the difficulty in testing the
> firmware under all possible wear-out conditions.
>
> One question worth asking is : in the cases you were involved in, was
> manufacturer failure analysis performed (and if so what was the
> failure cause reported?).

Unfortunately not.  Many of the people I deal with, particularly the
ones with budgets to be early SSD adopters, are not the sort to return
things that have failed to the vendor.  In some of these shops, if the
data can't be securely erased first, it doesn't leave the place.  The
idea that some trivial fix at the hardware level might bring the drive
back to life, data intact, is terrifying to many businesses when drives
fail hard.

Your bigger point, that this could just easily be software failures due
to unexpected corner cases rather than hardware issues, is both a fair
one to raise and even more scary.

>> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
>> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
>> mechanical drives is around 2% during their first year, spiking to 5%
>> afterwards.  I suspect that Intel's numbers are actually much better
>> than the other manufacturers here, so a SSD from anyone else can
>> easily be less reliable than a regular hard drive still.
>>
> Hmm, this is speculation I don't support (non-intel vendors have a 10x
> worse early failure rate). The entire industry uses very similar
> processes (often the same factories). One rogue vendor with a bad
> process...sure, but all of them ??
>

I was postulating that you only have to be 4X as bad as Intel to reach
2.4%, and then be worse than a mechanical drive for early failures.  If
you look at http://labs.google.com/papers/disk_failures.pdf you can see
there's a 5:1 ratio in first-year AFR just between light and heavy usage
on the drive.  So a 4:1 ratio between best and worst manufacturer for
SSD seemed possible.  Plenty of us have seen particular drive models
that were much more than 4X as bad as average ones among regular hard
drives.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: Fwd: Re: SSDD reliability

From

Toby Corkindale

Date:

05 May 2011, 23:02:24

On 05/05/11 18:36, Florian Weimer wrote:
> * Greg Smith:
>
>> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
>> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
>> mechanical drives is around 2% during their first year, spiking to 5%
>> afterwards.  I suspect that Intel's numbers are actually much better
>> than the other manufacturers here, so a SSD from anyone else can
>> easily be less reliable than a regular hard drive still.
>
> I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
> in a RAID-1 configuration are weared down in the same way, and it would
> be rather inconvenient if they failed at the same point.  With hard
> disks, this doesn't seem to happen; even bad batches fail pretty much
> randomly.

Actually I think it'll be the same as with hard disks.
ie. A batch of drives with sequential serial numbers will have a fairly
similar average lifetime, but they won't pop their clogs all on the same
day. (Unless there is an outside influence - see note 1)

The wearing-out of SSDs is not as exact as people seem to think. If the
drive is rated for 10,000 erase cycles, then that is meant to be a
MINIMUM amount. So most blocks will get more than that amount, and maybe
a small number will die before that amount. I guess it's a probability
curve, engineered such that 95% or some other high percentage will
outlast that count. (and the SSDs have reserved blocks which are
introduced to take over from failing blocks, invisibly to the end-user
-since it can always read from the failing-to-erase block)

Note 1:
I have seen an array that was powered on continuously for about six
years, which killed half the disks when it was finally powered down,
left to cool for a few hours, then started up again.

Re: Fwd: Re: SSDD reliability

From

Toby Corkindale

Date:

11 May 2011, 04:19:47

BTW, I saw a news article today about a brand of SSD that was claiming
to have the price effectiveness of MLC-type chips, but with lifetime of
4TB/day over 5 years.

http://www.storagereview.com/anobit_unveils_genesis_mlc_enterprise_ssds

which also links to:
http://www.storagereview.com/sandforce_and_ibm_promote_virtues_mlcbased_ssds_enterprise

which is a similar tech - much improved erase-cycle-counts on MLC.

No doubt this'll be common in all SSDs in a year or so then!

Re: Fwd: Re: SSDD reliability

From

"mark"

Date:

18 May 2011, 21:50:41

> Note 1:
> I have seen an array that was powered on continuously for about six
> years, which killed half the disks when it was finally powered down,
> left to cool for a few hours, then started up again.
>


Recently we rebooted about 6 machines that had uptimes of 950+ days.
Last time fsck had run on the file systems was 2006.

When stuff gets that old, has been on-line and under heavy load all that
time you actually get paranoid about reboots. In my newly reaffirmed
opinion, at that stage reboots are at best a crap shoot. We lost several
hours to that gamble more than we had budgeted for. HP is getting more of
their gear back than in a usual month.

Maybe that is just life with HP.


-M

Re: SSDD reliability

From

Martin Gainty

Date:

18 May 2011, 21:57:27

what is this talk about replicating your primary database to secondary nodes in the cloud.. or is cloud computing still marketing hype?

Martin
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.

Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

> From: dvlhntr@gmail.com
> To: toby.corkindale@strategicdata.com.au; pgsql-general@postgresql.org
> Subject: Re: Fwd: Re: [GENERAL] SSDD reliability
> Date: Wed, 18 May 2011 18:50:28 -0600
>
> > Note 1:
> > I have seen an array that was powered on continuously for about six
> > years, which killed half the disks when it was finally powered down,
> > left to cool for a few hours, then started up again.
> >
>
>
> Recently we rebooted about 6 machines that had uptimes of 950+ days.
> Last time fsck had run on the file systems was 2006.
>
> When stuff gets that old, has been on-line and under heavy load all that
> time you actually get paranoid about reboots. In my newly reaffirmed
> opinion, at that stage reboots are at best a crap shoot. We lost several
> hours to that gamble more than we had budgeted for. HP is getting more of
> their gear back than in a usual month.
>
> Maybe that is just life with HP.
>
>
> -M
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Fwd: Re: SSDD reliability

From

Toby Corkindale

Date:

18 May 2011, 22:10:32

On 19/05/11 10:50, mark wrote:
>> Note 1:
>> I have seen an array that was powered on continuously for about six
>> years, which killed half the disks when it was finally powered down,
>> left to cool for a few hours, then started up again.
>>
>
>
> Recently we rebooted about 6 machines that had uptimes of 950+ days.
> Last time fsck had run on the file systems was 2006.
>
> When stuff gets that old, has been on-line and under heavy load all that
> time you actually get paranoid about reboots. In my newly reaffirmed
> opinion, at that stage reboots are at best a crap shoot. We lost several
> hours to that gamble more than we had budgeted for. HP is getting more of
> their gear back than in a usual month.

I worked at one place, years ago, which had an odd policy.. They had
automated hard resets hit all their servers on a Friday night, every week.
I thought they were mad at the time!

But.. it does mean that people design and test the systems so that they
can survive unattended resets reliably. (No one wants to get a support
call at 11pm on Friday because their server didn't come back up.)

It still seems a bit messed up though - even if friday night is a
low-use period, it still means causing a small amount of disruption to
customers - especially if a developer or sysadmin messed up, and a
server *doesn't* come back up.

Re: SSDD reliability

From

Craig Ringer

Date:

18 May 2011, 23:17:57

On 05/19/2011 08:57 AM, Martin Gainty wrote:
> what is this talk about replicating your primary database to secondary
> nodes in the cloud...

slow.

You'd have to do async replication with unbounded slave lag.

It'd also be very easy to get to the point where the load on the master
meant that the slave could never, ever catch up because there just
wasn't enough bandwidth.

--
Craig Ringer