On Mon, Oct 24, 2011 at 11:37 PM, David Boreham <david_list@boreham.org> wrote:
>> What about redundancy?
>>
>> How do you swap an about-to-die SSD?
>>
>> Software RAID-1?
>
> The approach we take is that we use 710 series devices which have predicted
> reliability similar to all the other components in the machine, therefore
> the unit of replacement is the entire machine. We don't use trays for
> example (which saves quite a bit on data center space). If I were running
> short endurance devices such as 320 series I would be interested in
> replacing the drives before the machine itself is likely to fail, but I'd do
> so by migrating the data and load to another machine for the replacement to
> be done offline. Note that there are other operations procedures that need
> to be done and can not be done without downtime (e.g. OS upgrade), so some
> kind of plan to deliver service while a single machine is down for a while
> will be needed regardless of the storage device situation.
Interesting.
But what about unexpected failures. Faulty electronics, stuff like that?
I really don't think a production server can work without at least raid-1.