[NCLUG] Re: parallel processing users?

John L. Bass jbass at dmsd.com
Tue Oct 18 12:33:16 MDT 2005


	Chad Perrin writes:
	That's . . . really weird.  Thank you for the information.  I'll
	definitely keep that in mind as I start looking at new systems we'll be
	acquiring in the future.

Intel has been scrambling to clean up this problem in the IA64 implementations,
but it's a tough problem to nail down 100%. The on chip caches, implemented as
static ram cells really suffer big time in radiation testing done by/for NASA,
which is just a precursor to understanding the problem at 5,000ft (here) or
10,000 (Leadville/Aspen) ... and nobody yet has explored in detail ground level
problems in the problem areas of eastern south america and the south alantic, or
polar belts. It's 100X worse if you are trying to use your notebook on a flight
at 35,000ft, and expect to actually trust the data in your notebook once you get
back to the office.

In the radiation testing, Intel fairs significantly better than AMD processors,
so there is a significant difference in system level FITs for Intel and AMD clusters
when you add up all the chips at risk. When you adjust for failures at 5,000ft,
modest sized clusters can expect to see events in time spans ranging from minutes
to a few days depending on the number of nodes, amount of memory and other chips
like FPGA's that are in the system.

>From the referenced IBM 1990's papers, the huge worry, which is only getting worse,
is the risk of undetectable multibit errors in current systems with relatively poor
ECC designs that can only detect double bit errors, and larger errors significantly
risk non-detection.

Now, that is all based on typical background levels, and doesn't even take into
count what happens during particularly bad solar storms/flares which can increase
the risk in real time by 1-3 orders of magnitude.

So just how valuable is the data in your computer systems? Can you risk non-recoverable
encrypted or compressed data files that are corrupted before getting to disk/tape?
Can you risk corrupted production data files for your design or chemical formula
as its sent out to production? Can you risk your core business assets having mild
and cumulative (growing) data corruption over time?

Designing for this really takes some serious thought. First in researching complete
machine designs which have the lowest possible FIT rate as a system, then knowing
just where the risks are and programming your system design to be failure aware to
tollerate the errors that WILL occur over time. This generally means designing your
core applications around data which has your own error detection and correction
layered on top of what the hardware does, and then structuring your processing so
that it can be both checked, and corrected on the fly or any time in the future.

High availability simply isn't a reality when implemented poorly with non-reliable 
subsystems -- IE six nines uptime doesn't mean anything with corrupt data. Using
COTS PC's with non-ecc or poor ecc miss-matched to the memory devices and architecture
in a large "high availability" cluster, just means that you now have crash free
data corruption unless your application and operating system design specifically
address these problems.

Linux has a number of faults here. By design it agressively caches read data, which
can stay in the dram for as long as the machine is running for very large memory
configurations. Once corrupted (in a non-ecc machine, or a multibit non-detected
error in a poor ecc design), the corrupted version in cache will seldom be replaced
by the good copy on disk, if ever. Minimizing the life of objects in memory is one
key way to reduce the risk of running with corrupted data, and this includes
executable data objects as well. A true high availability design would strive to
minimize cache residency, exercise cache validation and scrubbing operations
to maximize the integrity of the system over time.

Anyway ... reliable system design is more than an art, it's a very detail oriented
process that leaves NOTHING to chance. There is no easy recipe to just use some
packaged cluster management tools and the problems all go away.

I hope this non-Politically Correct in the Linux Cluster community message gets
thru to those that have enterprise data which is mission critical. Few, if any,
PC's are your friend here.

John



More information about the NCLUG mailing list