[NCLUG] sw raid, recovery after install

Sean Reifschneider jafo at tummy.com
Wed Jan 16 23:01:25 MST 2013


On 01/16/2013 08:19 PM, Bob Proulx wrote:
> otherwise it is too easy to put too much weight into 0.00001% or even
> get things backward.  I haven't been able to benchmark it yet.

I agree with you, show me the numbers.  Without the numbers, talk of
performance is problematic to anything closer than an order of magnitude or
two.

> I don't know.  Does anyone program networking using the official ISO
> 7 layer model?  Just because it makes better documentation for humans

I'm generally considered to have a pretty good grasp of networking, and I
am pretty dismissive of the 7-layer model in general.  I can't quote it,
and the only ones I encounter are layer 2, layer 3, and layer 7, mostly in
networking gear feature discussions.  "Does this switch include routing
functionality?"  "Are you load-balancing based on TCP headers or payload
data?"  that's when I use it most.

> That is all well and good on a new filesystem or rebuilding an empty
> filesystem.  But I must assume that a 95% full and large filesystem
> would still take quite a while to sync when replacing a failed disk.

I wasn't trying to imply that it was magically fast on everything, just to
demonstrate that it understood what data needed to be rebuilt and what was
irrelevant...

It's a bit worse than that I'm afraid.  On a 95% full array, ZFS is
almost certainly slower than a normal RAID-1 or RAID-5 rebuild, because
instead of just streaming the data across it has to walk the meta-data
tree and reconstruct things.  It's basically a massive set of database
operations, depending on how the data was written.

So, usually when you start a verify or rebuild operation you can see it run
very slowly for a while, maybe 5MB/sec, for some period of time, then it
will jump up as it starts transferring data rather than poking around the
database...

Usually this is not so bad, and well worth it for what ZFS gives you.  But,
I have one system that seems to be quite fragmented or something:

   root at backup1:~# zpool status | grep scrub
    scrub: scrub in progress for 37h19m, 20.79% done, 142h11m to go

That's on "only" 7.5TB of data, 50% utilized.

Its sister system is much happier:

   root at backup2:~# zpool status | grep scrub
    scrub: scrub completed after 32h27m with 0 errors on Wed Jan 16 16:27:58 2013

That's 9.6TB used, 64% utilized, also seems to have a fair bit of deduped
data.

My home storage system is:

    scrub: scrub completed after 5h35m with 0 errors on Sun Jan 13 11:35:15 2013

4.25TB, 67% utilized, no dedup.  The home system is ~14x 500GB drives, the
backup systems are 8x 2TB drives.  That may have an impact.

The rebuild rates are all over the map...

> Same thing for RAID-1 too.  Because the raid layer doesn't know
> anything about the filesystem layer it must just get the raw data in
> sync between the disks.

Indeed, that is true.  It just "feels worse" under RAID-5 because there are
all those checksums that should be correct, and software RAID-1 under Linux
never verifies fully anyway (as I mentioned in my previous message)...

> I am deeply flattered.  But I am also often wrong.  I am not going to

Yeah, but you know that you know that you are sometimes wrong, which is
valuable.  :-)

> I was very clear that it was a "preference" and not a hard black and

I guess my concern was that someone who didn't fully understand the nuances
of this decision would decide it was the one true way.  There are things
that the more experienced people may do, that I would recommend people in
general stay away from...

> I will note that it is interesting to me that the bug Mike reported
> referenced the reboot at the end of an install that might interrupt a
> long running sync and restart it causing the process to restart back
> at the beginning again.  That sentiment plays right into my thinking.

Well, personally, I'm not that concerned about losing a couple of minutes
RAID build time.  :-)  Our typical server install takes around 2, maybe 2.5
minutes to do.  Desktop, especially Ubuntu, can take longer, but gosh...  I
honestly don't, while installing or using the newly installed system,
notice the background RAID array.  Either in software RAID or on hardware
controllers.

>> My primary worry here is that if a drive fails and you go to replace
>> it by pulling the drive and putting a new one in, you'd better make
>> sure that the failed slices among all the arrays are on the same
>> drive.
> 
> Well...  If the drive hard fails it usually isn't a problem because

I've definitely seen cases where the drive only fails to a certain extent.
This week.  :-)  Software RAID sometimes doesn't kill the drive fully, the
system basically hangs until the drive recovers.

This system hung for a few minutes as the drive was generating errors.  It
then recovered and the drive has been able to do 2 full array rebuilds.
This is a RAID-1.

I suspect that if the array was split up as you were proposing, that one of
the sub-components would have turned up as failed, but the others may have
been fine.  Again, speculation...  But if you didn't notice it and then
later did a replacement of another drive, it could be surprising...  You're
expecting a simple drive replacement and instead have to do a recovery from
backups.

Note that in this case, the SMART self-tests are showing fine, but smart is
showing some errors on the drive.

> The majority of soft failures I have seen have been associated with
> power failures.  If the machine drops out the status of the arrays is

Unlikely to be the case in the environment I'm referring to, our server
space is pretty solid on the power side of things...  It requires all 3
generators and UPSs that power the two transfer switches to fail before a
cabinet loses power.  :-)  A+B redundant feeds with no more than one
device shared among circuits...

> one of them will have a problem, at any given problem opportunity
> point.  A twin engine airplane is twice as likely to have an engine
> problem than a single engine airplane.

Agreed.  But you still want two of them when you're over the ocean.  :-)

> 1. Not regularly running the consistency checks.  Being extremely
> stale and this very long stale time being a consistency problem.

I meant you are relying on manually checking /proc/mdstat, not that you are
running a RAID verify operation.  All too many people don't set up
automatic array checking and alerting when they set up a RAID array.  Sad
but true...

But RAID verify is also extremely important, we schedule them at least once
a month.

> 2. Having a split due to some soft failure with some partitions on one
> device and some on the other device.

Sure, I wouldn't be surprised to see that happen in the power failure case
as you mentioned.

> But only with soft failures such as due to a power drop on a running
> machine.  Since it is not a hardware failure I will simply re-added
> the opposite device back and let it sync.  That is no different from

I also consider drive read failures for whatever reason and these "lockups"
like I saw on the machine this week to be soft failures too.

In our production environment we almost always consider this a "warning
sign" and replace the drive.  Since we've been running burn-in testing
before deploying drives, the number of drives falling out of arrays has
dropped dramatically.  I think some of them are simply marginal sectors,
and exercising them causes the drive to remap them.  Cull the weak ones
out...  :-)

> when there is only one partition per disk and it is not a hardware
> failure.  If there is only one partition it is the same thing and the
> same action is needed.  So I don't see this as any significant
> difference at all.

I'm probably not explaining this very well here, but I think there is an
opportunity for one partition to fail and not be noticed until another
failure happens and you think it's one drive before realizing that one of
the 10 slices of this array that is failed is on another drive.

> I didn't read the discussion that way.  For me a raid failure is
> mostly a yawn, need to replace the disk, affair.

We very much agree on that case.  But I've seen so many people in lists
like this say "My array failed and I have to recover the data off it
because I have no backups."

> any security upgrades in all of that time.  There was a backup.  But
> it needed to have FC9 installed and nothing more recent.  I am not
> really set up to install every random out of date system and so that

Usually I'll just boot to rescue and then stream the data in over the
network from the backup or similar.  However, the tricky situation there is
when you don't have the rescue media for the exact distro on the system, so
you pick something close and "mke2fs" has set some option on the
file-system that the OS you are streaming in doesn't recognize, so it won't
mount it.

> likely will have mismatches.  Those mismatches are reported during the
> check.  This looks concerning.  But they are unavoidable and normal

The more important thing to me is the exercising of both drives, and
recognizing any that, if a rebuild were necessary couldn't be read, as
early as possible, hopefully while the other drive still works.

We had a client that had a Windows system in their data center that was
degraded.  This machine had been badly mismanaged before we got involved,
and no array validation or even monitoring was done.  Oh, and it wasn't
backed up.  But it was critical for a client of theirs, and so I proposed
that during our maintenance window we replace the failed drive.

Turned out that the other drives had read errors in some places, but those
places were parts of the OS that weren't actively used.  But they had data
on them, so a file-system level backup would fail as well.

What a nightmare.

> Fourteen drives!  I think your electric bill is probably like mine.  I
> have electric heat in my gas furnace house.  But most "casual" admins
> don't have that many local storage devices to manage.

Yeah, yeah, I know.  Not very green...  3 4TB 5400RPM drives would be way
better...  Not sure how I live with myself.  :-)

> I know you have moved away from BackupPC but one of the interesting
> features it had was a configurable variable that even if a file were
> up to date in the backup and unchanged that at some low random
> percentage of the time it would decide it needed to read it again to

Yep.  I've done that before and after BackupPC, using a "--checksum" option
to rsync, something like:

   [ "$[RANDOM%30]" == 0 ] && CHECKSUM=--checksum
   rsync $CHECKSUM [...]

Sean



More information about the NCLUG mailing list