On Wed, 2008-03-19 at 07:44 +0900, Craig Ringer wrote:
Gregory Youngblood wrote:
> Also, a very informative read:
> http://research.google.com/archive/disk_failures.pdf
> In short, best thing to do is watch SMART and be prepared to try and
> swap a drive out before it fails completely. :)
>
I currently have four brand new 1TB disks (7200RPM SATA - they're for
our backup server). Two of them make horrible clicking noises - they're
rapidly parking and unparking or doing constant seeks. One of those two
also spins up very loudly, and on spin down rattles and buzzes.
Their internal SMART "health check" reports the problem two to be just
fine, and both pass a short SMART self test (smartctl -d ata -t short).
Both have absurdly huge seek_error_rate values, but the SMART thresholds
see nothing wrong with this.
-->8 snip 8<--
In that Google report, one of their conclusions was that after the first scan error drives were 39 times more likely to fail within the next 60 days. And, first errors in reallocations, etc. also correlated to higher failure probabilities.
In my way of thinking, and what I was referring to above, was using those error conditions to identify drives to change before the reported complete failures. Yes, that will mean changing drives before SMART actually says there is a full failure, and you may have to fight to get a drive replaced under warranty when you do so, but you are protecting your data.
I agree with you completely that waiting for SMART to actually indicate a true failure is pointless due to the thresholds set by mfrs. But using SMART for early warning signs still has value IMO.