[NCLUG] sw raid, recovery after install

Michael Milligan milli at acmeps.com
Thu Jan 17 04:35:31 MST 2013


On 01/16/2013 11:06 PM, Bob Proulx wrote:
> Michael Milligan wrote:
>> Yup.  Caveat being though that by the time you have a RAID mismatch,
>> you've exhausted the drive (or drives!) bad-block remapping capability
>> and are now teetering on the precipice of disaster.
> 
> I disagree because I think you are confusing array mismatches with
> drive errors.  Those aren't related.  Once a drive starts to give you
> block errors then of course what you say about exhausting the drives
> spare blocks is true.  But if there aren't any drive errors then the
> drive isn't having problems and the issue is the layer above them.

I've never seen that happen.  Anytime I've ever seen a read or write
error logged (which would cause the situation you mention) the md driver
has kicked the "bad" drive out and degraded the array.  I've seen
transient problems do this, e.g., marginal SATA cables.  Md is very
intolerant of (sensitive to?) read or write failures of any kind that
impact data integrity.

> Array mismatches occur during normal operation of the array even with
> perfect drives.  It is annoying that the check produces chatter but it

Huh?  If that's true then you should run screaming for the door...
seriously.  That's violating the data integrity and redundancy that RAID
is supposedly giving you.

> isn't really a problem.  It is just reporting raw information.  Raw
> information that should be ignored.

If your mirror or parity blocks aren't correct, you've got data
corruption.  No question about it.  You should be concerned and not
ignore that.

> 
>> S.M.A.R.T.  monitoring is a better tool to watch for drives going
>> bad.
> 
> Better?  Yes.  Good at it?  I don't think so.  And I am zeroing in on
> the word "going" meaning prediction of drive problems in the future
> tense.
> 
> I haven't seen SMART be a great predictor of drive failure.

It works for me.  Soon as a drive starts throwing "Uncorrectable"
messages, it's gonna snowball.  Of the 20-some disk failures I've seen
in the last 15 years, I see the counts of uncorrectable sections start
to mount exponentially after that.  Drive has to be replaced NOW.
Ignore those message at your peril.  ;-)

Regards,
Mike




More information about the NCLUG mailing list