[NCLUG] SCSI disk spares management utility

John L. Bass jbass at dmsd.com
Tue Apr 15 12:35:23 MDT 2003


I made the previous statements because many young IT guys do not have a
clue about error recover and data corruption issues on disk drives. There
is frequently an assumption that the tools/drives just do the right thing.

Mike Loseke writes:
>  I see a couple conflicting statements. Namely "software raid array" and
> "critical business applications". If the former is needed, and indeed being
> used, in support of the latter then one would hope that the person who is
> in fear of being terminated proposed the project with the caveat of
> something like this happening to cover themself - downtime *will* occur.

I spent 4 years working in Banking and Medical data centers, where data
corruption was absolutely unacceptable - and failing to follow established
logging and recovery proceedures at any failure or disruption was a termination
offense. Over the past 30 some years, I have worked in/with dozens of other
data centers where it should have been - where IT staff cost hundreds of hours
of other org's time in recovering from data corruption that occured on IT
managed servers and desktops. As such, "auto-magical" data corruption is a
personal hot button - as the corruption typically isn't identified until
months later when it's very difficult to fully identify or correct, even
with excellent backup history.

I have personally spent far too many hours doing filesystem audits against
backup tapes to recover from "magical" IT induced filesystem corruption, both
as a storage driver/filesystem developer and consultant.

> Should said situation arise, replacing hardware and restoring from backups
> would, one would hope, be the next documented logical steps.

One would certainly hope, or there is some IT manager that should be looking
for a job or manadated training in how to do the job function. Frequently
it's not the filesystem restoration that is tricky, but from rolling forward
from the restoration to recover the work load since the backup. Most online
database systems have transaction journels to help this, most other systems
do not, which is a failing of many application systems designers that IT can
not recover from by itself.

>  In any case, if badblocks is not a viable solution, it sounds like you
> are uniquely qualified to write such a utility. That's what the rsync guys
> tell me whenever I ask a question on that list anyway.

Having done so several times in the past, I was definately hoping not to
as other things on my TODO list have a higher priority. Sadly the Seagate
utility, didn't have the particular function I was looking for, but did
have edit capability on all the drive options - including auto remapping.
I haven't taken the time to check Matt's other suggestions yet.

In the short term I simply replaced the drive, but may go back turn on the
read error remap flag and run a drive test on it, then turn the flag
back off - or simply move it to a solaris/Irix system and use their utils.

John



More information about the NCLUG mailing list