Thread: OT (slightly) testing for data loss on an SSD drive due to power failure

OT (slightly) testing for data loss on an SSD drive due to power failure

From
John Rouillard
Date:
Hi all:

I realize this is slightly off topic, but is an issue of concern with
the use of ssd's. We are setting up a storage server under solaris
using ZFS. We have a couple of ssd's 2 x 160GB Intel X25-M MLC SATA
acting as the zil (write journal) and are trying to see if it is safe
to use for a power fail situation.

Our testing (10 runs) hasn't shown any data loss, but I am not sure
our testing has been running long enough and is valid, so I hoped the
people here who have tested an ssd for data loss may have some
guidance.

The testing method is to copy a bunch of files over NFS to the server
with the zil. When the copy is running along, pull the power to the
server. The NFS client will stop and if the client got a message that
block X was written safely to the zil, it will continue writing with
block x+1. After the server comes backup and and the copies
resume/finish the files are checksummed. If block X went missing, the
checksums will fail and we will have our proof.

We are looking at how to max out the writes to the SSD on the theory
that we need to fill the dram buffer on the SSD and get it saturated
enough such that it can't flush data to permanent storage as fast as
the data is coming in. (I.E. make it a writeback with a longer delay
so it's more likely to drop data.)

Does anybody have any comments or testing methodologies that don't
involve using an actual postgres instance?

Thanks for your help.
--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

On 04/22/2011 10:04 AM, John Rouillard wrote:
> We have a couple of ssd's 2 x 160GB Intel X25-M MLC SATA
> acting as the zil (write journal) and are trying to see if it is safe
> to use for a power fail situation.
>

Well, the quick answer is "no".  I've lost several weekends of my life
to recovering information from database stored on those drivers, after
they were corrupted in a crash.

> The testing method is to copy a bunch of files over NFS to the server
> with the zil. When the copy is running along, pull the power to the
> server. The NFS client will stop and if the client got a message that
> block X was written safely to the zil, it will continue writing with
> block x+1. After the server comes backup and and the copies
> resume/finish the files are checksummed. If block X went missing, the
> checksums will fail and we will have our proof.
>

Interestingly, you have reinvented parts of the standard script for
testing for data loss, diskchecker.pl:
http://brad.livejournal.com/2116715.html

You can get a few thousand commits per second using that program, which
is enough to fill the drive buffer such that a power pull should
sometimes lose something.  I don't think you can do a proper test here
using NFS; you really need something that is executing fsync calls
directly in the same pattern a database server will.

ZFS is more resilient than most filesystem as far as avoiding file
corruption in this case.  But you should still be able to find some
missing transactions that are sitting in the drive cache.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books