[NCLUG] sw raid, recovery after install
Michael Milligan
milli at acmeps.com
Thu Jan 17 04:20:36 MST 2013
On 01/16/2013 11:01 PM, Sean Reifschneider wrote:
> On 01/16/2013 08:19 PM, Bob Proulx wrote:
>
> The rebuild rates are all over the map...
I've noticed the same thing, even with ARC tuning. Scrub speeds don't
seem to be related to array size or # of drives, not in any expected way.
>
>> I will note that it is interesting to me that the bug Mike reported
>> referenced the reboot at the end of an install that might interrupt a
>> long running sync and restart it causing the process to restart back
>> at the beginning again. That sentiment plays right into my thinking.
Way back when I reported that, the resync process (RAID5 on 3 drives)
had a non-negligible impact on the length of an install session. It's
supposed to give current writes/reads priority, but it didn't do that
very well. It was close to double the amount of time compared to a
second identical system I built where I manually built the array
specifically using 'mdadm create ... --assume-clean ... '.
> Well, personally, I'm not that concerned about losing a couple of minutes
> RAID build time. :-) Our typical server install takes around 2, maybe 2.5
It was on the order of an extra 30 minutes for me. Very annoying.
> minutes to do. Desktop, especially Ubuntu, can take longer, but gosh... I
> honestly don't, while installing or using the newly installed system,
> notice the background RAID array. Either in software RAID or on hardware
> controllers.
Fair enough.
>
>>> My primary worry here is that if a drive fails and you go to replace
>>> it by pulling the drive and putting a new one in, you'd better make
>>> sure that the failed slices among all the arrays are on the same
>>> drive.
>>
>> Well... If the drive hard fails it usually isn't a problem because
>
> I've definitely seen cases where the drive only fails to a certain extent.
> This week. :-) Software RAID sometimes doesn't kill the drive fully, the
> system basically hangs until the drive recovers.
One of those two cases of a double drive failure that happened to me was
caused by having two RAID5 arrays across a sets of partitions on the
same drives, i.e., a /dev/md0 using partition 1 across 3 drives and a
/dev/md1 using partition 2 across 3 drives. I got stuck in the
situation where one of the arrays went degraded due to bad blocks on
drive A (in partition 1), and the second array went degraded due to bad
blocks on drive B (in partition 2) while I was looking into the first
failure. Drive C was fine. But what a nightmare. In this case, I was
stuck needing to decide which array I was going to trash and rebuild
from backup because I'd be putting one array or the other into a second
drive failure and thus losing it. I was able to cheat and work-around
this with a temporary spare drive that was big enough to hold the
contents of one of the partitions, but was not as simple as
hot-replacing the bad drive and re-syncing.
That's the danger of not using full drives for RAID5 (or even RAID6) arrays.
> I'm probably not explaining this very well here, but I think there is an
> opportunity for one partition to fail and not be noticed until another
> failure happens and you think it's one drive before realizing that one of
> the 10 slices of this array that is failed is on another drive.
Exactly what happened to me. I don't do that anymore. (Duh.)
Regards,
Mike
More information about the NCLUG
mailing list