[NCLUG] SCSI disk spares management utility

John L. Bass jbass at dmsd.com
Tue Apr 15 10:15:48 MDT 2003


Sean Reifschneider <jafo-nclug at tummy.com> writes:
>  On Thu, Apr 10, 2003 at 02:13:27PM -0600, John L. Bass wrote:
>  >Have a couple bad sectors on a drive inside a software raid array that
>  >forced reconstruction to fail. Really need a scsi drive level sparing
>  >utility, not a filesystem level.
> 
>  Well, badblocks is at the disc level, not file-system...  The point is
>  that most drives will automatically do the block relocation when they
>  detect a bad block, so reading/writing to the disc a few times will
>  often do the trick...
> 
>  Sean

Just another case where the industry does the easy thing for support, and
quietly leaves the customer with corrupt files. For critical business
applications this is VERY WRONG.  This "may" work fine for most low end
desktop applications depending on the customers data and backups.

After doing storage drivers for nearly 30 years, just about the only case
that large systems vendors leave it enabled, is for low tech end users to
minimize service calls and end-user frustration with persistant bad-block
errors.  Particularly since when the drive automatically maps a hard read
error, there is certainly going to be a corrupted file in the filesystem
that the end user will have to muddle there way thru. In nearly every
managed facility, allowing corrupted files to exist without immediate
intervention is probably close to a termination offense for incompetence.
As such, managed facilities turn off auto remapping, manually issolate and
recover corrupted files, and manually remap the bad sectors in the process,
IFF, the drive is worth trusting in production - if the drive has reached
80% or more of it's life cycle, it's generally pulled on the spot to avoid
the risk of being at the end of the bathtub curve for the drive.

Besides that, in high end RAID and JBOD applications, auto sparing is almost
always turned off to prevent filling up the drives spares table with "normal"
transient errors which are a natural artifact of the basic BERR and continuous
use. Especially for power supply ripple induced errors caused by drive cycling
and synchronization, as well as external power line transients.

John



More information about the NCLUG mailing list