[NCLUG] sw raid, recovery after install

Thu Jan 17 04:20:36 MST 2013

On 01/16/2013 11:01 PM, Sean Reifschneider wrote:
> On 01/16/2013 08:19 PM, Bob Proulx wrote:

> 
> The rebuild rates are all over the map...

I've noticed the same thing, even with ARC tuning.  Scrub speeds don't
seem to be related to array size or # of drives, not in any expected way.

> 
>> I will note that it is interesting to me that the bug Mike reported
>> referenced the reboot at the end of an install that might interrupt a
>> long running sync and restart it causing the process to restart back
>> at the beginning again.  That sentiment plays right into my thinking.

Way back when I reported that, the resync process (RAID5 on 3 drives)
had a non-negligible impact on the length of an install session.  It's
supposed to give current writes/reads priority, but it didn't do that
very well.  It was close to double the amount of time compared to a
second identical system I built where I manually built the array
specifically using 'mdadm create ... --assume-clean ... '.

> Well, personally, I'm not that concerned about losing a couple of minutes
> RAID build time.  :-)  Our typical server install takes around 2, maybe 2.5

It was on the order of an extra 30 minutes for me.  Very annoying.

> minutes to do.  Desktop, especially Ubuntu, can take longer, but gosh...  I
> honestly don't, while installing or using the newly installed system,
> notice the background RAID array.  Either in software RAID or on hardware
> controllers.

Fair enough.

> 
>>> My primary worry here is that if a drive fails and you go to replace
>>> it by pulling the drive and putting a new one in, you'd better make
>>> sure that the failed slices among all the arrays are on the same
>>> drive.
>>
>> Well...  If the drive hard fails it usually isn't a problem because
> 
> I've definitely seen cases where the drive only fails to a certain extent.
> This week.  :-)  Software RAID sometimes doesn't kill the drive fully, the
> system basically hangs until the drive recovers.

One of those two cases of a double drive failure that happened to me was
caused by having two RAID5 arrays across a sets of partitions on the
same drives, i.e., a /dev/md0 using partition 1 across 3 drives and a
/dev/md1 using partition 2 across 3 drives.  I got stuck in the
situation where one of the arrays went degraded due to bad blocks on
drive A (in partition 1), and the second array went degraded due to bad
blocks on drive B (in partition 2) while I was looking into the first
failure.  Drive C was fine.  But what a nightmare.  In this case, I was
stuck needing to decide which array I was going to trash and rebuild
from backup because I'd be putting one array or the other into a second
drive failure and thus losing it.  I was able to cheat and work-around
this with a temporary spare drive that was big enough to hold the
contents of one of the partitions, but was not as simple as
hot-replacing the bad drive and re-syncing.

That's the danger of not using full drives for RAID5 (or even RAID6) arrays.

> I'm probably not explaining this very well here, but I think there is an
> opportunity for one partition to fail and not be noticed until another
> failure happens and you think it's one drive before realizing that one of
> the 10 slices of this array that is failed is on another drive.

Exactly what happened to me.  I don't do that anymore.  (Duh.)

Regards,
Mike