Thread: Reports from SSD purgatory

Reports from SSD purgatory

From
Greg Smith
Date:
News update for anyone else who's trapped like me, waiting for a fix to
the Intel 320 SSD bug where they can truncate themselves to 8MB.  Over
the weekend Intel has announced a firmware fix for the problem is done,
and is due to ship "within the next two weeks":
http://communities.intel.com/thread/24121

On the larger SSD reliability front, Tom's Hardware surveyed heavy SSD
users they're friendly with who use Intel drives.  The most interesting
data came from Softlayer:
http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-6.html

This supports two claims I made before based on my private data that
were controversial:

-Annualized SSD failure rates are not significantly lower than
traditional drives in the first couple of years.  Jury is still out on
whether they will spike upwards starting at 3 years as mechanical ones do.

-The most common source of dead drives is sudden, catastrophic
electronics failure.  These are not predicted by SMART, and have nothing
to do with hitting the drive's wear limits.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Re: Reports from SSD purgatory

From
David Boreham
Date:
<dons flameproof underpants once more...>

This comment by the author I think tends to support my theory that most of the
failures seen are firmware related (and not due to actual hardware failures, which
as I mentioned in the previous thread are very rare and should occur roughly equally
often in hard drives as SSDs) :

As we explained in the article, write endurance is a spec'ed failure. That won't happen in the first year, even at enterprise level use. That has nothing to do with our data. We're interested in random failures. The stuff people have been complaining about... BSODs with OCZ drives, LPM stuff with m4s, the SSD 320 problem that makes capacity disappear... etc... Mostly "soft" errors. Any hard error that occurs is subject to the "defective parts per million" problem that any electrical component also suffers from.

and from the main article body:

Firmware is the most significant, and we see its impact in play almost every time an SSD problem is reported.

(Hard drives also suffer from firmware bugs of course)

I think I'm generally encouraged by this article because it suggests that once the firmware bugs are fixed (or if you buy from a vendor less likely to ship with bugs in the first place), then SSD reliability will be much better than it is perceived to be today.



 

Re: Reports from SSD purgatory

From
Greg Smith
Date:
On 08/15/2011 07:49 PM, Greg Smith wrote:
> News update for anyone else who's trapped like me, waiting for a fix
> to the Intel 320 SSD bug where they can truncate themselves to 8MB.
> Over the weekend Intel has announced a firmware fix for the problem is
> done, and is due to ship "within the next two weeks":
> http://communities.intel.com/thread/24121

http://communities.intel.com/thread/24205
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=18363

I can't believe I'm going to end up using FreeDos to fix this problem.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Re: Reports from SSD purgatory

From
Date:

---- Original message ----
>Date: Mon, 15 Aug 2011 19:49:52 -0400
>From: pgsql-performance-owner@postgresql.org (on behalf of Greg Smith <greg@2ndQuadrant.com>)
>Subject: [PERFORM] Reports from SSD purgatory
>To: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
>
>News update for anyone else who's trapped like me, waiting for a fix to
>the Intel 320 SSD bug where they can truncate themselves to 8MB.  Over
>the weekend Intel has announced a firmware fix for the problem is done,
>and is due to ship "within the next two weeks":
>http://communities.intel.com/thread/24121
>
>On the larger SSD reliability front, Tom's Hardware surveyed heavy SSD
>users they're friendly with who use Intel drives.  The most interesting
>data came from Softlayer:
>http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-6.html
>
>This supports two claims I made before based on my private data that
>were controversial:
>
>-Annualized SSD failure rates are not significantly lower than
>traditional drives in the first couple of years.  Jury is still out on
>whether they will spike upwards starting at 3 years as mechanical ones do.
>
>-The most common source of dead drives is sudden, catastrophic
>electronics failure.  These are not predicted by SMART, and have nothing
>to do with hitting the drive's wear limits.

It's worth knowing exactly what that means.  Turns out that NAND quality is price specific.  There's gooduns and
baduns. Is this a failure in the controller(s) or the NAND? 

Also, given that PG is *nix centric and support for TRIM is win centric, having that makes a big difference in
performance.  


>
>--
>Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
>PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
>
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

Re: Reports from SSD purgatory

From
Merlin Moncure
Date:
On Wed, Aug 24, 2011 at 1:48 PM,  <gnuoytr@rcn.com> wrote:
>
>
> ---- Original message ----
>>Date: Mon, 15 Aug 2011 19:49:52 -0400
>>From: pgsql-performance-owner@postgresql.org (on behalf of Greg Smith <greg@2ndQuadrant.com>)
>>Subject: [PERFORM] Reports from SSD purgatory
>>To: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
>>
>>News update for anyone else who's trapped like me, waiting for a fix to
>>the Intel 320 SSD bug where they can truncate themselves to 8MB.  Over
>>the weekend Intel has announced a firmware fix for the problem is done,
>>and is due to ship "within the next two weeks":
>>http://communities.intel.com/thread/24121
>>
>>On the larger SSD reliability front, Tom's Hardware surveyed heavy SSD
>>users they're friendly with who use Intel drives.  The most interesting
>>data came from Softlayer:
>>http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-6.html
>>
>>This supports two claims I made before based on my private data that
>>were controversial:
>>
>>-Annualized SSD failure rates are not significantly lower than
>>traditional drives in the first couple of years.  Jury is still out on
>>whether they will spike upwards starting at 3 years as mechanical ones do.
>>
>>-The most common source of dead drives is sudden, catastrophic
>>electronics failure.  These are not predicted by SMART, and have nothing
>>to do with hitting the drive's wear limits.
>
> It's worth knowing exactly what that means.  Turns out that NAND quality is price specific.  There's gooduns and
baduns. Is this a failure in the controller(s) or the NAND? 
>
> Also, given that PG is *nix centric and support for TRIM is win centric, having that makes a big difference in
performance.

one point about TRIM -- no raid controller that I know of supports
trim, which suggests it might not even be possible to support.  How
much does it help really? Probably not as much as you would think
because newer SSD drives have very sophisticated controllers that make
it at least partially obsolete.

merlin

merlin

Re: Reports from SSD purgatory

From
david@lang.hm
Date:
On Wed, 24 Aug 2011, Merlin Moncure wrote:

> On Wed, Aug 24, 2011 at 1:48 PM,  <gnuoytr@rcn.com> wrote:
>>
>>
>>
>> Also, given that PG is *nix centric and support for TRIM is win
>> centric, having that makes a big difference in performance.
>
> one point about TRIM -- no raid controller that I know of supports
> trim, which suggests it might not even be possible to support.  How
> much does it help really? Probably not as much as you would think
> because newer SSD drives have very sophisticated controllers that make
> it at least partially obsolete.

if the SSD can know that the user doesn't care about data in a particular
block, the SSD can overwrite that block with new data.

Since the SSDs do their writing to new blocks and erase old blocks later,
the more empty blocks you have available, the less likely you are to hit a
garbage collection pause when you try to write to the drive.

if you are careful to never write temporary files to the drive and only
use it for database-like 'update in place' type of things (no 'write a new
file and then rename it over the old one' tricks), then TRIM won't make
any difference because every block that you have ever written to is one
that you care about (or close enough to this for practical purposes)

but if you don't take this care, the drive works to preserve all the data
blocks that you have ever written to, even if the filesystem has freed
them and dosn't care about them. The worst case would be a log strcutured
filesystem (btrfs for example) where every write is to a new block and
then the old block is freed later.

David Lang

Re: Reports from SSD purgatory

From
"Tomas Vondra"
Date:
On 24 Srpen 2011, 20:48, gnuoytr@rcn.com wrote:

> It's worth knowing exactly what that means.  Turns out that NAND quality
> is price specific.  There's gooduns and baduns.  Is this a failure in the
> controller(s) or the NAND?

Why is that important? It's simply a failure of electronics and it has
nothing to do with the wear limits. It simply fails without prior warning
from the SMART.

> Also, given that PG is *nix centric and support for TRIM is win centric,
> having that makes a big difference in performance.

Windows specific? What do you mean? TRIM is a low-level way to tell the
drive 'this block is empty and may be used for something else' - it's just
another command sent to the drive. It has to be supported by the
filesystem, though (e.g. ext4/btrfs support it).

Tomas


Re: Reports from SSD purgatory

From
Merlin Moncure
Date:
On Wed, Aug 24, 2011 at 2:32 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 24 Srpen 2011, 20:48, gnuoytr@rcn.com wrote:
>
>> It's worth knowing exactly what that means.  Turns out that NAND quality
>> is price specific.  There's gooduns and baduns.  Is this a failure in the
>> controller(s) or the NAND?
>
> Why is that important? It's simply a failure of electronics and it has
> nothing to do with the wear limits. It simply fails without prior warning
> from the SMART.
>
>> Also, given that PG is *nix centric and support for TRIM is win centric,
>> having that makes a big difference in performance.
>
> Windows specific? What do you mean? TRIM is a low-level way to tell the
> drive 'this block is empty and may be used for something else' - it's just
> another command sent to the drive. It has to be supported by the
> filesystem, though (e.g. ext4/btrfs support it).

Well, it's a fair point that TRIM support is probably more widespread
on windows.

merlin

Re: Reports from SSD purgatory

From
Date:

---- Original message ----
>Date: Wed, 24 Aug 2011 21:32:16 +0200
>From: pgsql-performance-owner@postgresql.org (on behalf of "Tomas Vondra" <tv@fuzzy.cz>)
>Subject: Re: [PERFORM] Reports from SSD purgatory
>To: gnuoytr@rcn.com
>Cc: pgsql-performance@postgresql.org
>
>On 24 Srpen 2011, 20:48, gnuoytr@rcn.com wrote:
>
>> It's worth knowing exactly what that means.  Turns out that NAND quality
>> is price specific.  There's gooduns and baduns.  Is this a failure in the
>> controller(s) or the NAND?
>
>Why is that important? It's simply a failure of electronics and it has
>nothing to do with the wear limits. It simply fails without prior warning
>from the SMART.

It matters because if it's the controller, there's nothing one can do about it (the vendor).  If it's the NAND, then
thevendor/customer can get drives with gooduns rather than baduns.  Not necessarily a quick fix, but knowing the
qualityof the NAND in the SSD you're planning to buy matters. 
>
>> Also, given that PG is *nix centric and support for TRIM is win centric,
>> having that makes a big difference in performance.
>
>Windows specific? What do you mean? TRIM is a low-level way to tell the
>drive 'this block is empty and may be used for something else' - it's just
>another command sent to the drive. It has to be supported by the
>filesystem, though (e.g. ext4/btrfs support it).

My point.  The firmware and MS have been faster to support TRIM than *nix, linux in particular.  Those that won't/can't
moveto a recent kernel don't get TRIM. 

>
>Tomas
>
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

Re: Reports from SSD purgatory

From
David Boreham
Date:
On 8/24/2011 1:32 PM, Tomas Vondra wrote:
> Why is that important? It's simply a failure of electronics and it has
> nothing to do with the wear limits. It simply fails without prior
> warning from the SMART.

In the cited article (actually in all articles I've read on this
subject), the failures were not properly analyzed*.
Therefore the conclusion that the failures were of electronics
components is invalid.
In the most recent article, people have pointed to it as confirming
electronics failures
but the article actually states that the majority of failures were
suspected to be
firmware-related.

We know that a) there have been failures, but b) not the cause.

We don't even know for sure that the cause was not cell wear.
That's because all we know is that the drives did not report
wear before failing. The wear reporting mechanism could be broken for
all we know.

--
*A "proper" analysis would involve either the original manufacturer's FA
lab, or a qualified independent analysis lab.



Re: Reports from SSD purgatory

From
"Tomas Vondra"
Date:
On 24 Srpen 2011, 21:41, Merlin Moncure wrote:
> On Wed, Aug 24, 2011 at 2:32 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 24 Srpen 2011, 20:48, gnuoytr@rcn.com wrote:
>>> Also, given that PG is *nix centric and support for TRIM is win
>>> centric,
>>> having that makes a big difference in performance.
>>
>> Windows specific? What do you mean? TRIM is a low-level way to tell the
>> drive 'this block is empty and may be used for something else' - it's
>> just
>> another command sent to the drive. It has to be supported by the
>> filesystem, though (e.g. ext4/btrfs support it).
>
> Well, it's a fair point that TRIM support is probably more widespread
> on windows.

AFAIK the only versions that supports it natively are Windows 7 and
Windows Server 2008 R2 - with other versions you're stuck with
command-line tools equal to wiper.sh or hdparm. So I don't see a
significant difference here - with a reasonably new systems (at least
kernel 2.6.33), the support is about the same.

Obviously there more machines with Windows, especially in the field of
desktop/laptop, but that does not make the TRIM Windows-specific I guess.
Most of them runs old versions (without TRIM support) anyway.

Tomas


Re: Reports from SSD purgatory

From
"Tomas Vondra"
Date:
On 24 Srpen 2011, 21:42, gnuoytr@rcn.com wrote:
>
>
> ---- Original message ----
>>Date: Wed, 24 Aug 2011 21:32:16 +0200
>>From: pgsql-performance-owner@postgresql.org (on behalf of "Tomas Vondra"
>> <tv@fuzzy.cz>)
>>Subject: Re: [PERFORM] Reports from SSD purgatory
>>To: gnuoytr@rcn.com
>>Cc: pgsql-performance@postgresql.org
>>
>>On 24 Srpen 2011, 20:48, gnuoytr@rcn.com wrote:
>>
>>> It's worth knowing exactly what that means.  Turns out that NAND
>>> quality
>>> is price specific.  There's gooduns and baduns.  Is this a failure in
>>> the
>>> controller(s) or the NAND?
>>
>>Why is that important? It's simply a failure of electronics and it has
>>nothing to do with the wear limits. It simply fails without prior warning
>>from the SMART.
>
> It matters because if it's the controller, there's nothing one can do
> about it (the vendor).  If it's the NAND, then the vendor/customer can get
> drives with gooduns rather than baduns.  Not necessarily a quick fix, but
> knowing the quality of the NAND in the SSD you're planning to buy matters.

OK, now I see the difference. Still, it'll be quite difficult to find out
which NAND manufacturers are good, especially when the drive manufacturer
may use more of them at the same time. And as David Boreham pointed out,
we don't know why the drives actually failed :-(

>>> Also, given that PG is *nix centric and support for TRIM is win
>>> centric,
>>> having that makes a big difference in performance.
>>
>>Windows specific? What do you mean? TRIM is a low-level way to tell the
>>drive 'this block is empty and may be used for something else' - it's
>> just
>>another command sent to the drive. It has to be supported by the
>>filesystem, though (e.g. ext4/btrfs support it).
>
> My point.  The firmware and MS have been faster to support TRIM than *nix,
> linux in particular.  Those that won't/can't move to a recent kernel don't
> get TRIM.

Faster? Windows 7 was released on October 2009, Linux supports TRIM since
February 2010. That's about 3 or 4 months difference - given that it may
easily take a year to put a new OS / kernel into a production, it's
negligible difference. For example most of the corporations / banks I'm
working for are still using Windows XP.

Don't get me wrong - I'm not blindly fighting against Windows, I just
don't see how this makes the TRIM a windows-specific feature.

Tomas


Re: Reports from SSD purgatory

From
david@lang.hm
Date:
On Wed, 24 Aug 2011, Tomas Vondra wrote:

> On 24 Srpen 2011, 21:42, gnuoytr@rcn.com wrote:
>>
>>
>>
>> My point.  The firmware and MS have been faster to support TRIM than *nix,
>> linux in particular.  Those that won't/can't move to a recent kernel don't
>> get TRIM.
>
> Faster? Windows 7 was released on October 2009, Linux supports TRIM since
> February 2010. That's about 3 or 4 months difference - given that it may
> easily take a year to put a new OS / kernel into a production, it's
> negligible difference. For example most of the corporations / banks I'm
> working for are still using Windows XP.
>
> Don't get me wrong - I'm not blindly fighting against Windows, I just
> don't see how this makes the TRIM a windows-specific feature.

the thing is that many people using Linux are using RedHat Enterprise
Linux 5, which was released several years prior to that, and trim is not
one of the things that Red Hat has backported to their ancient kernel. so
for those people it doesn't exist prior to RHEL 6.0 which was released
much more recently.

David Lang