[NCLUG] sw raid, recovery after install

Sat Jan 19 18:54:05 MST 2013

Sean Reifschneider wrote:
> ... 32h27m ...

I think you might be running into the problem of just having a huge
amount of data to move.  Disks have been getting larger faster than
buses have been getting faster.

> 4.25TB, 67% utilized, no dedup.  The home system is ~14x 500GB drives, the
> backup systems are 8x 2TB drives.  That may have an impact.

Having the extra mechanisms should give you much better performance than
having them all bundled into one mechanism.  Parallel head operations
and all of that.

> Bob Proulx wrote:
> > I was very clear that it was a "preference" and not a hard black and
> 
> I guess my concern was that someone who didn't fully understand the
> nuances of this decision would decide it was the one true way.
> There are things that the more experienced people may do, that I
> would recommend people in general stay away from...

It is really no different than having a 2x 250G system and then
upgrading them to 2x 500G disks and adding in the extra space as
additional PVs to the lvm volume group.  You end up with the same
thing by evolution that I would put into place by design.

The cool thing about RAID is that I can do a disk space increase in
place to the current system.  Just replace one disk at a time and sync
and walk the data over to the new disks in place.  Then add in the
extra space.  Then expand the filesystems to use the extra space.

Because it is software raid I always reboot at certain points, makes
me feel better, but in theory it is possible to do this upgrade by hot
swapping and without bringing the machine offline at all.  It hasn't
been worth it to me to work through the non-reboot process and a
reboot at that point is quick.  Makes me feel better.  This is the
low-tier reliability software raid cost point anyway.  If you need the
99.999 five nines uptime then you really want a hardware controller
system anyway.

> > I will note that it is interesting to me that the bug Mike reported
> > referenced the reboot at the end of an install that might interrupt a
> > long running sync and restart it causing the process to restart back
> > at the beginning again.  That sentiment plays right into my thinking.
> 
> Well, personally, I'm not that concerned about losing a couple of minutes
> RAID build time.  :-)  Our typical server install takes around 2, maybe 2.5
> minutes to do.  Desktop, especially Ubuntu, can take longer, but gosh...  I
> honestly don't, while installing or using the newly installed system,
> notice the background RAID array.  Either in software RAID or on hardware
> controllers.

It is probably ten minutes or so for me from install to finish.  It
really depends upon how large of a system I am installing.  The
desktop can take ten minutes just by itself.  But I never install a
desktop on a server machine.

But I agree the minutes sync'ing the raid there don't matter.  Just
reboot and restart the sync.  It isn't enough investment to matter.

But the case I was talking about is usually because I have been
tinkering around trying to do something and finally get to the point
where I want to reboot it some hours later.  Hate losing that 51% of
the work already done.  Perhaps I just need more patience.

> >> My primary worry here is that if a drive fails and you go to replace
> >> it by pulling the drive and putting a new one in, you'd better make
> >> sure that the failed slices among all the arrays are on the same
> >> drive.
> > 
> > Well...  If the drive hard fails it usually isn't a problem because
> 
> I've definitely seen cases where the drive only fails to a certain
> extent.  This week.  :-) Software RAID sometimes doesn't kill the
> drive fully, the system basically hangs until the drive recovers.
> 
> This system hung for a few minutes as the drive was generating
> errors.  It then recovered and the drive has been able to do 2 full
> array rebuilds.  This is a RAID-1.

It takes the Linux kernel SATA driver something like two minutes or so
to detect a disk failure even if it is a hard failure such as yanking
a cable out.  And during that time the system is not happy.  Things
are usually blocked waiting behind that stuck drive.  And then when it
finally times out everything breaks free, runs, and catches up.

> I suspect that if the array was split up as you were proposing, that
> one of the sub-components would have turned up as failed, but the
> others may have been fine.  Again, speculation...  But if you didn't
> notice it and then later did a replacement of another drive, it
> could be surprising...  You're expecting a simple drive replacement
> and instead have to do a recovery from backups.

(me makes a face) Well... This means that there is definitely
something wrong, enough that someone is going to replace the drive,
but they don't look to see which partitions are which and on which
drive?  Please tell me that anyone replacing the disk would look to
see which disk to replace!  Otherwise I will lose all of my faith in
humanity.

This is where the enterprise grade hardware raid disk controllers such
as the HP Compaq SmartArray controllers are really nice.  LED
indicator lights on the drive.  All green means all good.  A drive
fails and it displays a red light.  Grab the handle and pull it out
and replace it with a replacement disk.  The controller detects this
and automatically syncs as needed.  The light turns green.  The host
operating system using it never sees a failure and keeps running.  No
keyboard or display needed.  Just leds on the drives caddy and someone
to walk through the data center looking for red lights every day.  It
is pretty foolproof when used that way.

> Note that in this case, the SMART self-tests are showing fine, but smart is
> showing some errors on the drive.

I *hate* that!

> > The majority of soft failures I have seen have been associated with
> > power failures.  If the machine drops out the status of the arrays is
> 
> Unlikely to be the case in the environment I'm referring to, our server
> space is pretty solid on the power side of things...  It requires all 3
> generators and UPSs that power the two transfer switches to fail before a
> cabinet loses power.  :-)  A+B redundant feeds with no more than one
> device shared among circuits...

Nice.

> > one of them will have a problem, at any given problem opportunity
> > point.  A twin engine airplane is twice as likely to have an engine
> > problem than a single engine airplane.
> 
> Agreed.  But you still want two of them when you're over the ocean.  :-)

If I am over the ocean then I want a seaplane.  Then if I got into
real trouble I could taxi the rest of the way.  :-)

> > 1. Not regularly running the consistency checks.  Being extremely
> > stale and this very long stale time being a consistency problem.
> 
> I meant you are relying on manually checking /proc/mdstat, not that you are
> running a RAID verify operation.  All too many people don't set up
> automatic array checking and alerting when they set up a RAID array.  Sad
> but true...

Oh...  I was assuming that the notification of a failure was a given
with any type of raid.  It didn't occur to me that someone would set
up a raid but then not receive failure notifications from it.

> > 2. Having a split due to some soft failure with some partitions on one
> > device and some on the other device.
> 
> Sure, I wouldn't be surprised to see that happen in the power failure case
> as you mentioned.
> 
> > But only with soft failures such as due to a power drop on a running
> > machine.  Since it is not a hardware failure I will simply re-added
> > the opposite device back and let it sync.  That is no different from
> 
> I also consider drive read failures for whatever reason and these "lockups"
> like I saw on the machine this week to be soft failures too.

I think that falls into the category of "stuff happens".  Cosmic rays
from space.  Poltergeists.  Gremlins.  We have all seen odd
unexplained things happen.

> In our production environment we almost always consider this a "warning
> sign" and replace the drive.  Since we've been running burn-in testing
> before deploying drives, the number of drives falling out of arrays has
> dropped dramatically.

If you don't mind saying, how long do you burn drives in?

> I think some of them are simply marginal sectors, and exercising
> them causes the drive to remap them.  Cull the weak ones out...  :-)

Whips.  Massive whips!  :-)

> > when there is only one partition per disk and it is not a hardware
> > failure.  If there is only one partition it is the same thing and the
> > same action is needed.  So I don't see this as any significant
> > difference at all.
> 
> I'm probably not explaining this very well here, but I think there is an
> opportunity for one partition to fail and not be noticed until another
> failure happens and you think it's one drive before realizing that one of
> the 10 slices of this array that is failed is on another drive.

And I am assuming that there will be a notification seen when the
first slice is ejected.

The problem isn't when we have been maintaining a system from birth
through middle age and it is starting to get geriatric with old age
problems.  The problem is when someone says that they have a system
that has been running fine but that now has fallen down and hurt
itself.  Can you please come over and take a look at it?  It is like
an ER situation then and anything might be possible.

> > I didn't read the discussion that way.  For me a raid failure is
> > mostly a yawn, need to replace the disk, affair.
> 
> We very much agree on that case.  But I've seen so many people in lists
> like this say "My array failed and I have to recover the data off it
> because I have no backups."

I am sorry for their loss.

> > likely will have mismatches.  Those mismatches are reported during the
> > check.  This looks concerning.  But they are unavoidable and normal
> 
> The more important thing to me is the exercising of both drives, and
> recognizing any that, if a rebuild were necessary couldn't be read, as
> early as possible, hopefully while the other drive still works.

Yes.  And a good backup _should_ exercise this just by the nature of
needing to read the data for backup.

> We had a client that had a Windows system in their data center that was
> degraded.  This machine had been badly mismanaged before we got involved,
> and no array validation or even monitoring was done.  Oh, and it wasn't
> backed up.  But it was critical for a client of theirs, and so I proposed
> that during our maintenance window we replace the failed drive.
> 
> Turned out that the other drives had read errors in some places, but those
> places were parts of the OS that weren't actively used.  But they had data
> on them, so a file-system level backup would fail as well.
> 
> What a nightmare.

Life in the ER.  Get the crash cart!

> > Fourteen drives!  I think your electric bill is probably like mine.  I
> > have electric heat in my gas furnace house.  But most "casual" admins
> > don't have that many local storage devices to manage.
> 
> Yeah, yeah, I know.  Not very green...  3 4TB 5400RPM drives would be way
> better...  Not sure how I live with myself.  :-)

But then you would be having data bottle neck problems with that large
data through the fewer mechanisms.

> > I know you have moved away from BackupPC but one of the interesting
> > features it had was a configurable variable that even if a file were
> > up to date in the backup and unchanged that at some low random
> > percentage of the time it would decide it needed to read it again to
> 
> Yep.  I've done that before and after BackupPC, using a "--checksum" option
> to rsync, something like:
> 
>    [ "$[RANDOM%30]" == 0 ] && CHECKSUM=--checksum
>    rsync $CHECKSUM [...]

I like it!

But that is bash/ksh not python?  (shock!)  [Since $RANDOM and "=="
are both ksh/bashisms. :-) ]

Bob