[NCLUG] sw raid, recovery after install

Bob Proulx bob at proulx.com
Wed Jan 16 20:19:21 MST 2013


Sean Reifschneider wrote:
> Just to expand a bit on what was said before...

And it was a long posting!  Good information.

> Linux RAID works at the block level.  It has no understanding of what parts
> of the drive are used.  In order to be able to do a future verify
> operation, which I would recommend, the checksums for every stripe of the
> array needs to be calculated.

Or instead of checksums it can use a scoreboard.  I think that is what
the mdadm bitmap feature provides.  I haven't seen any performance
issues with the mdadm bitmap feature but I note that I got griped to
in a different discussion that it lowered the performance.  I always
like to see benchmark data when it comes to performance issues because
otherwise it is too easy to put too much weight into 0.00001% or even
get things backward.  I haven't been able to benchmark it yet.

> ZFS, in contrast, combines the RAID and file-system such that it knows what
> blocks are used and only computes the checksums and error correction data
> on used blocks.
> ...
> Yes, ZFS is a "rampant layering violation", but this isn't just to piss off
> the Linux kernel developers, there's a good reason for it.

Instead of being two different layers that don't know about each other
it is one layer that has both features.  So it can cross the barriers
and do the right thing with the combined knowledge of both.  That is
spiffy.

I don't know.  Does anyone program networking using the official ISO
7 layer model?  Just because it makes better documentation for humans
to understand doesn't make it a better program model.  There needs to
be some judgement there too.

> So creating a new ZFS takes very little time and is immediately
> consistent.  And better, a RAID rebuild or verify on a freshly
> created ZFS takes almost no time at all.

That is all well and good on a new filesystem or rebuilding an empty
filesystem.  But I must assume that a 95% full and large filesystem
would still take quite a while to sync when replacing a failed disk.
Because there isn't anything getting around needing to copy the data
to mirror the data between the physical volumes.

> But, back to the md RAID-5, this is why it does a rebuild on a freshly
> created array.  It takes every stripe, looks at what is on the data discs
> and computes and writes the checksum/error correcting stripe...  If you
> used the data that was already on the discs, a future validate or rebuild
> operation would fail on any stripes that hadn't been written to yet...

Same thing for RAID-1 too.  Because the raid layer doesn't know
anything about the filesystem layer it must just get the raw data in
sync between the disks.

> Now, I'll prefix this next bit by saying that I have the utmost respect for
> Bob.

I am deeply flattered.  But I am also often wrong.  I am not going to
have my feelings hurt if someone disagrees with me.  Through the
discussion there will be learning and I try to always be learning
something. :-)

Let me say that I also have great respect for Sean and his extensive
real world experience with a lot of data and servers.

> But, I don't agree with the idea of splitting a drive up into small chunks
> and making the array on them.

I was very clear that it was a "preference" and not a hard black and
white answer.  (Although depending upon how strongly you feel you
might think it is strongly black or white.)  There isn't anything
wrong with putting the entire disk all together in one chunk.  I just
listed that it was my preference not to do so.

I will note that it is interesting to me that the bug Mike reported
referenced the reboot at the end of an install that might interrupt a
long running sync and restart it causing the process to restart back
at the beginning again.  That sentiment plays right into my thinking.
But I just decided to split the problem up into smaller chunks so as
to lose less at any one time.

> My primary worry here is that if a drive fails and you go to replace
> it by pulling the drive and putting a new one in, you'd better make
> sure that the failed slices among all the arrays are on the same
> drive.

Well...  If the drive hard fails it usually isn't a problem because
then all of the partitions on that drive will fail over.  And even if
a drive fails in a flakey way the flakiness is from that single drive
and the other disk isn't affected.

If you have two flaky devices then you are in a world of hurt either way.

> Most of the RAID failures I've seen have been transient, where a drive
> drops out of an array, but the drive is still running.

Agreed.  Most of the time the drive hardware is okay and can be
re-sync'd okay.  The hard failure is when something physical needs to
be replaced.

The soft failure mode can be more "interesting".

> Usually, I can remove and re-add the drive and the rebuild will
> complete successfully and continue operating for an extended period
> of time.

Sometimes a very long time.  Years.  The rest of the life of the drive
if it wasn't the drive with a problem.

The majority of soft failures I have seen have been associated with
power failures.  If the machine drops out the status of the arrays is
dependent upon what it was writing at the time the power went away.
Arrays can split at that time through no hardware fault.  And there
are other soft failure modes too.

> My guess would be that if I had a bunch of arrays, that
> only some of them would be faulty at this point.

Agreed.  The more arrays you have then the higher the probability that
one of them will have a problem, at any given problem opportunity
point.  A twin engine airplane is twice as likely to have an engine
problem than a single engine airplane.

> BUT, if you don't check for consistency regularly, I could see where it may
> have been weeks or months since the last check, and you've had a slice fail
> on one drive in one array and on another drive in another array.

Those are two separate issues.

1. Not regularly running the consistency checks.  Being extremely
stale and this very long stale time being a consistency problem.

2. Having a split due to some soft failure with some partitions on one
device and some on the other device.

I think we are all agreed that #1 is bad.  Regular consistency checks
are needed to ensure that errors have not crept in.

And #2 is interesting because it does happen.  I have seen it.  I even
think it would be likely.  At any given power drop there will be a 50%
chance that it will either be device sda or sdb that has the most
recent write and therefore coming out on top after the reboot.  If
there are many partitions then it is more likely that the 50% split
will be across all of the partitions.

But only with soft failures such as due to a power drop on a running
machine.  Since it is not a hardware failure I will simply re-added
the opposite device back and let it sync.  That is no different from
when there is only one partition per disk and it is not a hardware
failure.  If there is only one partition it is the same thing and the
same action is needed.  So I don't see this as any significant
difference at all.

[Not to say that every power drop causes a split array.  That also
either happens or doesn't depending upon exactly what happens.  And
there are other things that will cause a soft failure too.  Cosmic
rays are my favorite scapegoat.  A ray flipped a bit and caused a
mismatch failure.  Of course there are many better BOFH reasons.]

> But largely it's just about simplicity.  That old axiom about "make things
> as simple as possible, but no simpler"?  To me, in this case, I don't see a
> compelling reason for splitting them up.

I definitely want to be clear that it was only a preference.  If it
isn't compelling for you then by all means don't do it.  I like it.
That is the beginning and end of it.

> Much of this discussion seems to me to be rooted in the idea that a RAID
> array failure is a near-world-ending-big deal.  Which to me says that RAID
> may be being relied on a bit too much.

I didn't read the discussion that way.  For me a raid failure is
mostly a yawn, need to replace the disk, affair.

(It was only a really bad problem for me when I was recovering someone
else's Fedora 9 system.  Only about five years out of date and without
any security upgrades in all of that time.  There was a backup.  But
it needed to have FC9 installed and nothing more recent.  I am not
really set up to install every random out of date system and so that
was quite painful to me.  But I don't let *my* systems get into that
state of being so out of date.  Knock on wood.)

> Never, ever, believe that a RAID array prevents the need for having backups
> of the data on it...

Strongly agree!  RAID is not backup.

> As far as RAID-level choice, I wouldn't necessarily steer away from RAID-5
> to the extent that Bob indicated.  I will admit that I usually deploy
> RAID-10 for our client server machines, because its worst case performance
> is closer to its average performance, particularly for small random writes.
> 
> However, for a home storage server, a workload that has large writes rather
> than small random writes, or that is read-heavy, RAID-5 can be a very good
> choice.

Okay.  Sure.  But this is a case where I would say that simpler is
better and RAID-5 or 6 just are not simple.  At least not as simple as
RAID-1 or RAID-10.  For the casual participant I would definitely
steer away from it.

> For my home storage server, I use the ZFS equivalent of RAID-6 (RAID-5 but
> with two parity instead of one).
> 
> Another issue with RAID-1 with md software RAID under Linux is that you
> can't do RAID verifications, or more precisely the RAID verifications
> generates a lot of errors.  When a short-lived file is written, both of the
> mirrored writes get queued, but if the file is then deleted after one drive
> has written the data but before the other has, that second write is removed
> from the queue and not done.  So the two drives do not have the same data.
> This is usually ok because that is not active data and it doesn't really
> matter what is in there.
>
> Something to consider, but I don't think it's a deal breaker.  I'd love for
> my verification to show no errors, but have come to terms with the fact
> that it almost never will on a software RAID-1 array.

Summary: You get a lot of noise from the consistency check about a
non-zero mismatch count and that information must be ignored.  It is
useless.

Yes.  And the reason you mentioned is also true for swap space too.
In either case if the kernel knows that the data is a don't-care then
it optimizes out the write to the disk as a performance improvement.
Since that write didn't happen the mirrors may have mismatches.  Very
likely will have mismatches.  Those mismatches are reported during the
check.  This looks concerning.  But they are unavoidable and normal
and shouldn't really be reported.  Hopefully at some point that will
be improved.

> My understanding is that RAID-5 does not have the same issue.

I don't know.  It is different code and will have a different checker
module so will have different quirks.

> I, admittedly, haven't read any of the "why you shouldn't use RAID-5"
> references you pointed at.  However, there are practical limits to the
> "RAID-10 instead of RAID-5", beyond just buying another drive of 7.  For
> example, my primary home storage system has 14 drives.

Fourteen drives!  I think your electric bill is probably like mine.  I
have electric heat in my gas furnace house.  But most "casual" admins
don't have that many local storage devices to manage.

> This is largely because it was built years ago using the inexpensive
> used drives that come out of our hosting servers.  I can't really go
> up to 24 drives (2x12, because this system is running RAID-6).  The
> machine that backs it up is running RAID-5 because the system I have
> doing it is a 1U which has 4 drive bays...

Or use 1U servers in their home.  :-)

> But, remember the absolute requirements for RAID: You must set up
> monitoring to alert on a drive failure, and you must regularly validate the
> data on it.
>
> I have all too frequently been called in when a RAID array was set up but
> had no monitoring, had one drive fail and then weeks or months later had a
> second drive fail.  If you aren't going to set up monitoring and alerting
> on array degradation, you might as well just set up striping (RAID-0),
> you're just as safe.

Agreed.  Strongly.

> As far as the validation goes, sometimes a block on a drive will "rot" over
> time, and if you get too many of these built up, especially across
> different parts of multiple drives, you can find that some of the data you
> need to perform a rebuild is no longer readable.

Probably a good idea to have some process read the cooked files
periodically.

I know you have moved away from BackupPC but one of the interesting
features it had was a configurable variable that even if a file were
up to date in the backup and unchanged that at some low random
percentage of the time it would decide it needed to read it again to
verify that it could.  Just so that it could detect errors and react
to them.

> On the subject of green drives.  Yes, you can get a huge power savings by
> spinning down a drive when it's not used.  However, many green drives are
> primarily green in that they run at a lower spindle speed and consume less
> power because of that.  You can spin down a 15K RPM disc too.  But IMHO
> much of the "green" drives are just marketing to get people to not look at
> 5400RPM discs as "not as good as 7200RPM discs".  :-)
> 
> Which is perfectly acceptable, in my home backup system I use green
> drives.

It probably depends upon the drive.  YMMV and that type of thing.  :-)
I have a very small sample size of green drives and I must have hit
the bad ones.  Because I could hear the drive spinning back up again
so I know it had spun down.  Once bitten and twice shy.  But I think
they are probably completely fine used standalone.  My comments were
only concerning their use an raid arrays.

> > These words triggered alarm bells in my head.  Because identical
> > drives are much more likely to fail about at the same time.  I always
> 
> You know, I've had the same thought, but I have never experienced drives
> bought from the same batch failing very close to each-other except during
> initial burn-in testing on my bench.  Sure, I want to replace a drive
> quickly if it fails, to reduce the window where a second drive could fail,
> but I have never seen it.  Of course, my sample size is probably only in
> the thousands of drives range...

You have definitely handled a lot of drives!  I have handled far
fewer.  And yet I have seen this twice.  (shrug)

Bob



More information about the NCLUG mailing list