[NCLUG] SCSI disk spares management utility

Tue Apr 15 10:58:02 MDT 2003

Thus spake John L. Bass:
> 
> Sean Reifschneider <jafo-nclug at tummy.com> writes:
> >  On Thu, Apr 10, 2003 at 02:13:27PM -0600, John L. Bass wrote:
> >  >Have a couple bad sectors on a drive inside a software raid array that
> >  >forced reconstruction to fail. Really need a scsi drive level sparing
> >  >utility, not a filesystem level.
> > 
> >  Well, badblocks is at the disc level, not file-system...  The point is
> >  that most drives will automatically do the block relocation when they
> >  detect a bad block, so reading/writing to the disc a few times will
> >  often do the trick...
> > 
> >  Sean
> 
> Just another case where the industry does the easy thing for support, and
> quietly leaves the customer with corrupt files. For critical business
> applications this is VERY WRONG.  This "may" work fine for most low end
> desktop applications depending on the customers data and backups.
> 
> After doing storage drivers for nearly 30 years, just about the only case
> that large systems vendors leave it enabled, is for low tech end users to
> minimize service calls and end-user frustration with persistant bad-block
> errors.  Particularly since when the drive automatically maps a hard read
> error, there is certainly going to be a corrupted file in the filesystem
> that the end user will have to muddle there way thru. In nearly every
> managed facility, allowing corrupted files to exist without immediate
> intervention is probably close to a termination offense for incompetence.
> As such, managed facilities turn off auto remapping, manually issolate and
> recover corrupted files, and manually remap the bad sectors in the process,
> IFF, the drive is worth trusting in production - if the drive has reached
> 80% or more of it's life cycle, it's generally pulled on the spot to avoid
> the risk of being at the end of the bathtub curve for the drive.
> 
> Besides that, in high end RAID and JBOD applications, auto sparing is almost
> always turned off to prevent filling up the drives spares table with "normal"
> transient errors which are a natural artifact of the basic BERR and continuous
> use. Especially for power supply ripple induced errors caused by drive cycling
> and synchronization, as well as external power line transients.

 I see a couple conflicting statements. Namely "software raid array" and
"critical business applications". If the former is needed, and indeed being
used, in support of the latter then one would hope that the person who is
in fear of being terminated proposed the project with the caveat of
something like this happening to cover themself - downtime *will* occur.
Should said situation arise, replacing hardware and restoring from backups
would, one would hope, be the next documented logical steps.

 In any case, if badblocks is not a viable solution, it sounds like you
are uniquely qualified to write such a utility. That's what the rsync guys
tell me whenever I ask a question on that list anyway.

-- 
                  | Death is only a state of mind. Only
   Mike Loseke    | it doesn't leave you much time to
 mike at verinet.com | think about anything else.