[NCLUG] sw raid, recovery after install

Wed Jan 2 17:00:04 MST 2013

On 01/02/2013 04:52 PM, Bob Proulx wrote:
> Stephen Warren wrote:
>> Bob Proulx wrote:
>>> Also I think you put all of the disk space in one large array.  At 40%
>>> your machine is estimating 235.8 minutes remaining.  Or 590 minutes,
>>> ten hours, for the full raid sync.  That is fine.  But if you need to
>>> reboot or have a power failure then the sync will restart ...
>>
>> At least with RAID-1 on recent kernels, the kernel maintains some kind
>> of checkpoint history, so at least a graceful reboot doesn't restart the
>> sync at the start of the array, but rather roughly where it left off.
> 
> It is available but you have to enable it.  Something like:
> 
>   mdadm /dev/md1 --grow --bitmap=internal
> 
> Then you will see the additional bitmap line displayed in /proc/mdstat.
> 
>   md2 : active raid1 sda6[0] sdb6[1]
>         312496256 blocks [2/2] [UU]
>         bitmap: 0/3 pages [0KB], 65536KB chunk

I'm pretty sure that's something else; I just did an install of Ubuntu
12.10 a little while back and saw the incremental RAID resync, but
didn't ever use the --bitmap option, not do I see the bitmap line in
/proc/mdstat. I do have "super 1.2", so perhaps this is a facet of
version 1.2 of the superblock structure.

>>> These days I create partitions of about 250G per partition.  Probably
>>> 30 minutes per 250G partition by memory on my machines.  Being able to
>>> check off smaller partitions like that is nicer when doing a large
>>> data recovery.
>>
>> I have occasionally thought about doing this, but: The problem with this
>> is you end up with a bunch of "tiny" storage devices. What happens when
>> there's some kind of weird RAID initialization issue on reboot and all
>> the arrays come up degraded, but half using one physical disk and the
>> other half the other physical disk.
> 
> How often are you seeing "some kind of weird RAID initialization issue
> on reboot" that causes an array to hard fail degrade to one disk?
> That actually sounds pretty scary.  I'm not seeing those.

I've never seen it. However, RAID is all about covering your bases,
about introducing safety measures to protect you when something goes
wrong. Making the safety measures as simple as possible, and as unlikely
as possible to fail in painful ways, sure seems like a good idea.

W.r.t. drive failures, a slightly-flakey-during-boot drive (which is
what might cause such a failure scenario) sure seems like it could
easily be a failure mode. For reference, I've only had a drive in a RAID
array (or anywhere) fail once, perhaps twice, out of the last 9+8+5
years (3 arrays) of running a SW RAID array either, so 0 and 1-or-2 are
fairly comparable:-) One of the failure modes was definitely along the
lines of sometimes the drive would work OK sometimes not.