[NCLUG] Background data on neutron SEU's and clusters

Mon Oct 17 22:01:43 MDT 2005

For background, start with this, and note the observations on pages 13 &14:

     www-1.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf

And this excellent paper by cisco:

     www.cisco.com/warp/public/779/largeent/learn/technologies/ina/IncreasingNetworkAvailability-WhitePaper.pdf

And then read for starting to get a handle on SEU's and clusters:

     www.tezzaron.com/about/papers/Soft%20Errors%201_1%20secure.pdf

Now stop and consider that this all is just the tip of the iceberg, as many other chips
in VLSI parts also contain RAMS, and generally SRAMS which are likely to have high and
uncharacterized SEU rates. Parts which contain large SRAMS (AKA FPGA's are particularly
at risk as well). As a result, it's not only the processor/memory data path that is at
risk to corrupt data, but storage and network data paths as well. Even small controllers
frequently have SRAMS at the core of the design, which hold non-parity data that is at
risk.

So, random reboots and data corruption for non-critical applications (informational
web servers) is one thing, but large segments of data ranging from IC MASK designs
to financial data can cost millions of dollars if the data is corrupted, and possibly
put your employer out of business from a particularly critical project failure. Even
corruption of distribution binaries and build source backups which are compressed or
encrypted can cause major losses.

For more info, have fun with a web search ... there are lots of hits :)

There is a huge difference between big iron servers which are designed with high degrees
of data robustness and reliability in the face of expected soft error rates, and COTS
PC's which frequently have absolutely NO data checking ANYWHERE in critical high risk
data paths. Even traditional workstations (AKA high cost) are built with full robustness
in mind as they are frequently designed with high end server technology and architectures.
Someone that mistakenly calls a COTS PC a workstation, and proceeds to build a NOW cluster
from COTS parts has made a potentially critical cost savings that WILL corrupt data, possibly
mission critical data.

I always let my clients know of the risks when the COTS advocates a proposing
moving mission critical data to an unreliable cluster. But there are some that
just know this can not be correct, as everybody is doing it. FIT rates just
don't lie in the long run, and it's just a matter of when a large cluster will
lose the SEU data inverse lottery. Dang statistics :)

have fun,
John