[NCLUG] HA clustering software input?

Mon Jun 23 16:31:23 MDT 2003

On Mon, Jun 23, 2003 at 02:35:38PM -0700, Erich wrote:
>Hello - I would like  to cluster a pair of servers which
>host a postgresql database on top of KRUD-8.  Having taken
>a quick look, it seems that SGI has a developed product
>called failsafe.  Commercial software is not out of the
>question.

I'm not familiar with the linux-failsafe stuff (despite the fact that we
host the mailing lists ;-).  I have really only used Alan's heartbeat
stuff and DRBD as a cheap way of doing shared storage (RAID-1 across
drives in different machines).  Both of these are available from the
linux-ha.org web-site.

Before working with Heartbeat, I had used HP's (purchased) technology
called "ServiceGuard/UX".  Heartbeat is incredibly similar, despite
Alan's insistance that he never had seen it before starting heartbeat.

One thing I will say is that it can be fairly easy to set up a HA
cluster that actually has one or more places which will negate your
attempts at increasing availability, or can even decrease the
reliability of your cluster.  For example, lack of redundant
communication paths and STONITH can produce unnecessary fail-overs or
data corruption, etc...

HP used to have a pretty good book on everything that's involved in
setting up a reliable cluster, but AFAIK it was only available to people
on the HP technical publications list.  That was back in '95 though, and
I don't even know if it's still published even for HP customers at this
point.  It was around 300 pages, and I can't at all recall what the
title was.

One of the typical methods of doing HA clustering involves having all
the HA data on it's own dedicated mount-point, and then mounting that up
and starting any applications, etc...  This can lead to a maintenance
nightmare and problems during fail-over, because users must be created
on both machines, updates must be installed in both places, and you
can't just throw together a crontab -- you have to install it on both
machines and set up a wrapper script that detects wether the script is
running on the machine that has aquired the app.

I have recently built up a procedure that allows HA clusters to be set
up such that shared data is a full bootable system, and a fail-over
involves (effectively) booting into the system.  This works well because
the "failed over" system can almost entirely be maintained as if it were
a stand-alone system...

Of course, that setup is much more complicated intitially to get going,
but the ongoing maintenance is much easier.  The other mechanism is much
easier to set up, but harder to maintain...

Those are some of the pointers I have on the subject.  I will point out
that Alan (the author of the Heartbeat code) works for IBM, and they
have a consulting arm.  I'll also point out that tummy.com also does
consulting.  :-)

Enjoy,
Sean
-- 
 "Fixing Unix is easier than living with NT."  -- Jonathan Gilpin
Sean Reifschneider, Member of Technical Staff <jafo at tummy.com>
tummy.com, ltd. - Linux Consulting since 1995.  Qmail, Python, SysAdmin
      Back off man. I'm a scientist.   http://HackingSociety.org/