[NCLUG] SCSI disk spares management utility

John L. Bass jbass at dmsd.com
Tue Apr 15 14:38:51 MDT 2003


Sean Reifschneider, <jafo at tummy.com> writes:
> For critical business applications, you shouldn't rely on the storage
> sub-system being perfect.  For example, one of our image archive
> solutions includes, effectively, tripwire in it.  The images that
> haven't been archived off to other media are daily checked for
> corruption and an alert is generated if any on-disc data doesn't match
> the appropriate hashes.

Actually for most critical business applications, without a near perfect
storage solution, you do not have an application at all. If you can not
trust the data in a machine, you can not trust the application. The storage
system may well be augmented with exteral agents to improve trust, but
unless the augmented system has a high degree of confidence, you do not
have a critical business solution.

Consider that for tripwire to work, two critcal assumptions are made:

	1) The file wasn't corrupted before the first tripwire scan

	2) The file is archival (IE does not dynamically change).

For dynamic files, IE major databases, transaction journals, etcs ... very
explict design work in the application system is necessary to validate the
system at any point to identify storage system faults should they occur.
Or the storage system itself needs to be internally augmented, such as a
complete raid solution.

> When I've worked with "critical business applications", we've usually
> just let the read error take the drive off-line, then reconstructed that
> data from the redundant data, relying on the bad block mapping to make
> the disc happy again.  Of course, this would usually be after pumping a
> bunch of data to/from the drive...

Less than a measurable percent of critical business servers I am aware of
directly have raid, most have a single drive, some a dual non-redunadant
drive configuration. This was particulary true of over 30,000 point of
sale systems for 20 different retail chains I have worked with as a
consultant to point of sale systems vendors using Xenix, SCO unix, Solaris,
FreeBSD, and Linux.

A good raid solution, allows one to place a lot of trust in the storage
system, but at the same time it's limitations and actions in the face of
common errors must also be well understood.

> I've never been in, or even heard of, an environment where they specified
> that you could be fired for not doing low-level analysis of a drive
> read failure...  This includes working with healthcare data and billing
> data.

Hmm .... what I precisely said was:

	"I spent 4 years working in Banking and Medical data centers, where data
	corruption was absolutely unacceptable - and failing to follow established
	logging and recovery proceedures at any failure or disruption was a termination
	offense."

The real point was that well thought out and documented proceedures existed to
avoid known causes of data corruptions ... not following them was a manditory
termination.  That might well include "low-level analysis of a drive read failure"
in some facilities, if the operational proceedure required it, but it certainly
was not what I said.

During my stay at United Califorina Bank, there were a lot of other manditory
termination offenses as well. The two hospital data centers I consulted for
also had manditory termination clauses in several areas that had clear life
and death issues .... in particular the pharmancy data base, partly because of
life and death issues surrounding automated dose management, partly because
the manditory audit requirements in handling certain controlled drugs.

Many small businesses can trade off poor procedural documentation for error
recovery by hiring well trained/experienced staff. At some point, most larger
businesses reach a point where critical systems can easily cost $100,000/hr
to have off-line ... at which point employees that don't figure out their
jobs are too expensive to keep around.

Nearly every major point of sale system I have worked on falls in that class,
losses in having the registers shut down exceed IT staff salaries, even for
small failures. The labor and store closure to re-inventory a store due to
loss of, or corruption of, the store database is more than the IT staff salary.
Screwups do not keep their jobs, even lacking a manditory policy.

John



More information about the NCLUG mailing list