[NCLUG] RAID array not started on re-boot

Mon Sep 9 12:21:56 MDT 2013

Greetings!

At work, we are suffering from a strange issue, and I am hoping the
collective wisdom of the group can provide insight.

We are running a box with CentOS 6.4, fully updated (well, perhaps minus
anything in the last week). This machine has two software RAID arrays
created with mdadm. One is a RAID1, and one is RAID0. In the normal course
of events, the RAID1 runs on /dev/md1 and the RAID0 on /dev/md2. The UUID
of each RAID is in /etc/fstab, and mount works when the devices are running.

Due to a strange combination of effects, in the past two weeks the machine
has twice lost power. In each instance, when power was restored to the
machine, the RAID0 was properly built with its two devices, but for some
reason the RAID1 was not created, and instead each of the two disks were
made into independent (though running in degraded mode) RAIDs arrays.

[root at dta ~]$ cat /proc/mdstat
Personalities : [raid1] [raid0]
md126  : active (auto-read-only) raid1 sdb[0]
     976551872 blocks [2/1] [U_]

md127 : active (auto-read-only) raid1 sda[0]
    976551872 blocks [2/1] [U_]

md2 : active raid0 sde1[0] sdf1[1]
   1953518592 blocks super 1.2 512k chunks

unused devices: <none>

I stopped (via mdadm --stop) md126 and md127, and then used mdadm to
assemble /dev/md1

[root at dta ~]$ mdadm --assemble /dev/md1
mdadm: /dev/md1 has been started with 2 drives.

[root at dta ~]$ cat /proc/mdstat
Personalities : [raid1] [raid0]
md1  : active raid1 sda1[0] sdb1[1]
    976419904 blocks super 1.2 [2/2[ [UU]

md2  : active raid0 sde1[0] sdf1[1]
    1953518592 blocks super 1.2 512k chunks

unused devices: <none>

So, clearly the system has some information about what is supposed to be in
/dev/md1.

The raid information is listed in /etc/mdadm.conf (since I have to type
this whole thing, I won't duplicate all of the information, but basically)

[root at dta ~]$ cat /etc/mdadm.conf
ARRAY /dev/md2 metadata=1.2 name=dta.example.com:2 UUID=A_UUID
Array /dev/md1 metadata=1.2 name=dta.example.com UUID=ANOTHER_UUID

My questions are:
* Why upon the restarting of the machine was /dev/md1 not properly created?
* Why did the system decide to create /dev/md126 and /dev/md127?
* And most importantly: what do I need to do to prevent this situation from
happening again? Well, beside the power question which is out of my
control. Is there a setting I need to change, something in the
initialization, etc.?

Thank you for any guidance or insight!

Kevin