[NCLUG] sw raid, recovery after install

Wed Jan 9 20:33:26 MST 2013

Just to expand a bit on what was said before...

Linux RAID works at the block level.  It has no understanding of what parts
of the drive are used.  In order to be able to do a future verify
operation, which I would recommend, the checksums for every stripe of the
array needs to be calculated.

ZFS, in contrast, combines the RAID and file-system such that it knows what
blocks are used and only computes the checksums and error correction data
on used blocks.  So creating a new ZFS takes very little time and is
immediately consistent.  And better, a RAID rebuild or verify on a freshly
created ZFS takes almost no time at all.

Yes, ZFS is a "rampant layering violation", but this isn't just to piss off
the Linux kernel developers, there's a good reason for it.

But, back to the md RAID-5, this is why it does a rebuild on a freshly
created array.  It takes every stripe, looks at what is on the data discs
and computes and writes the checksum/error correcting stripe...  If you
used the data that was already on the discs, a future validate or rebuild
operation would fail on any stripes that hadn't been written to yet...

Now, I'll prefix this next bit by saying that I have the utmost respect for
Bob.

But, I don't agree with the idea of splitting a drive up into small chunks
and making the array on them.  My primary worry here is that if a drive
fails and you go to replace it by pulling the drive and putting a new one
in, you'd better make sure that the failed slices among all the arrays
are on the same drive.

Most of the RAID failures I've seen have been transient, where a drive
drops out of an array, but the drive is still running.  Usually, I can
remove and re-add the drive and the rebuild will complete successfully and
continue operating for an extended period of time.  My guess would be that
if I had a bunch of arrays, that only some of them would be faulty at this
point.

BUT, if you don't check for consistency regularly, I could see where it may
have been weeks or months since the last check, and you've had a slice fail
on one drive in one array and on another drive in another array.

But largely it's just about simplicity.  That old axiom about "make things
as simple as possible, but no simpler"?  To me, in this case, I don't see a
compelling reason for splitting them up.

Much of this discussion seems to me to be rooted in the idea that a RAID
array failure is a near-world-ending-big deal.  Which to me says that RAID
may be being relied on a bit too much.

Never, ever, believe that a RAID array prevents the need for having backups
of the data on it...

As far as RAID-level choice, I wouldn't necessarily steer away from RAID-5
to the extent that Bob indicated.  I will admit that I usually deploy
RAID-10 for our client server machines, because its worst case performance
is closer to its average performance, particularly for small random writes.

However, for a home storage server, a workload that has large writes rather
than small random writes, or that is read-heavy, RAID-5 can be a very good
choice.

For my home storage server, I use the ZFS equivalent of RAID-6 (RAID-5 but
with two parity instead of one).

Another issue with RAID-1 with md software RAID under Linux is that you
can't do RAID verifications, or more precisely the RAID verifications
generates a lot of errors.  When a short-lived file is written, both of the
mirrored writes get queued, but if the file is then deleted after one drive
has written the data but before the other has, that second write is removed
from the queue and not done.  So the two drives do not have the same data.
This is usually ok because that is not active data and it doesn't really
matter what is in there.

My understanding is that RAID-5 does not have the same issue.

Something to consider, but I don't think it's a deal breaker.  I'd love for
my verification to show no errors, but have come to terms with the fact
that it almost never will on a software RAID-1 array.

I, admittedly, haven't read any of the "why you shouldn't use RAID-5"
references you pointed at.  However, there are practical limits to the
"RAID-10 instead of RAID-5", beyond just buying another drive of 7.  For
example, my primary home storage system has 14 drives.  This is largely
because it was built years ago using the inexpensive used drives that come
out of our hosting servers.  I can't really go up to 24 drives (2x12,
because this system is running RAID-6).  The machine that backs it up is
running RAID-5 because the system I have doing it is a 1U which has 4 drive
bays...

But, remember the absolute requirements for RAID: You must set up
monitoring to alert on a drive failure, and you must regularly validate the
data on it.

I have all too frequently been called in when a RAID array was set up but
had no monitoring, had one drive fail and then weeks or months later had a
second drive fail.  If you aren't going to set up monitoring and alerting
on array degradation, you might as well just set up striping (RAID-0),
you're just as safe.

As far as the validation goes, sometimes a block on a drive will "rot" over
time, and if you get too many of these built up, especially across
different parts of multiple drives, you can find that some of the data you
need to perform a rebuild is no longer readable.

On the subject of green drives.  Yes, you can get a huge power savings by
spinning down a drive when it's not used.  However, many green drives are
primarily green in that they run at a lower spindle speed and consume less
power because of that.  You can spin down a 15K RPM disc too.  But IMHO
much of the "green" drives are just marketing to get people to not look at
5400RPM discs as "not as good as 7200RPM discs".  :-)

Which is perfectly acceptable, in my home backup system I use green
drives.

> These words triggered alarm bells in my head.  Because identical
> drives are much more likely to fail about at the same time.  I always

You know, I've had the same thought, but I have never experienced drives
bought from the same batch failing very close to each-other except during
initial burn-in testing on my bench.  Sure, I want to replace a drive
quickly if it fails, to reduce the window where a second drive could fail,
but I have never seen it.  Of course, my sample size is probably only in
the thousands of drives range...

While I haven't personally used them, the Drobo systems do some fairly nice
things to reduce the opportunity for failures...  When a drive fails, the
array will automatically reduce capacity if possible and return to an
optimal state.  Until your data exceeds the number of drives still working,
you should have no data loss.  And their new low-end boxes can even take
advantage of SSDs to give you a read cache.  Only their enterprise system
will completely migrate hot blocks off to the SSDs, unfortunately...

Sean