Monday, December 5, 2016

Raid configuration and troubleshooting steps

1. Create a new RAID array

Create (mdadm —create) is used to create a new array:
1
mdadm --create --verbose /dev/md0 --level=1 /dev/sda1 /dev/sdb2
or using the compact notation:
1
mdadm -Cv /dev/md0 -l1 -n2 /dev/sd[ab]1

2. /etc/mdadm.conf

/etc/mdadm.conf or /etc/mdadm/mdadm.conf (on debian) is the main configuration file for mdadm. After we create our RAID arrays we add them to this file using:
1
mdadm --detail --scan >> /etc/mdadm.conf
or on debian
1
mdadm --detail --scan >> /etc/mdadm/mdadm.conf

3. Remove a disk from an array

We can’t remove a disk directly from the array, unless it is failed, so we first have to fail it (if the drive it is failed this is normally already in failed state and this step is not needed):
1
mdadm --fail /dev/md0 /dev/sda1
and now we can remove it:
1
mdadm --remove /dev/md0 /dev/sda1
This can be done in a single step using:
1
mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1

4. Add a disk to an existing array

We can add a new disk to an array (replacing a failed one probably):
1
mdadm --add /dev/md0 /dev/sdb1

5. Verifying the status of the RAID arrays

We can check the status of the arrays on the system with:
1
cat /proc/mdstat
or
1
mdadm --detail /dev/md0
The output of this command will look like:
1
2
3
4
5
6
7
8
9
10
cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdb3[1] sda3[0]
19542976 blocks [2/2] [UU]

md2 : active raid1 sdb4[1] sda4[0]
223504192 blocks [2/2] [UU]
here we can see both drives are used and working fine – U. A failed drive will show as F, while a degraded array will miss the second disk 
Note: while monitoring the status of a RAID rebuild operation using watch can be useful:
1
watch cat /proc/mdstat

6. Stop and delete a RAID array

If we want to completely remove a raid array we have to stop if first and then remove it:
1
2
mdadm --stop /dev/md0
mdadm --remove /dev/md0
and finally we can even delete the superblock from the individual drives:
1
mdadm --zero-superblock /dev/sda
Finally in using RAID1 arrays, where we create identical partitions on both drives this can be useful to copy the partitions from sda to sdb:
1
sfdisk -d /dev/sda | sfdisk /dev/sdb
(this will dump the partition table of sda, removing completely the existing partitions on sdb, so be sure you want this before running this command, as it will not warn you at all).
There are many other usages of mdadm particular for each type of RAID level, and I would recommend to use the manual page (man mdadm) or the help (mdadm —help) if you need more details on its usage. Hopefully these quick examples will put you on the fast track with how mdadm works.

Check the status of a raid device


[root@bcane ~]# mdadm --detail /dev/md10  
/dev/md10:  
 Version : 1.2  
 Creation Time : Sat Jul 2 13:56:38 2011  
 Raid Level : raid1  
 Array Size : 26212280 (25.00 GiB 26.84 GB)  
 Used Dev Size : 26212280 (25.00 GiB 26.84 GB)  
 Raid Devices : 2  
 Total Devices : 2  
 Persistence : Superblock is persistent  

 Update Time : Sat Jul 2 13:56:47 2011  
 State : clean, resyncing  
Active Devices : 2  
Working Devices : 2  
Failed Devices : 0  
 Spare Devices : 0  

Rebuild Status : 10% complete  

 Name : bcane.virtuals.local:10 (local to host bcane.virtuals.local)  
 UUID : 10a96ed5:92dc48e6:04b2bf43:3539e089  
 Events : 1  

 Number Major Minor RaidDevice State  
 0 8 33 0 active sync /dev/sdc1  
 1 8 49 1 active sync /dev/sdd1

In order to remove a drive it must first be marked as faulty. A drive can be marked as faulty either through a failure or if you want to manually mark a drive as faulty you can use the -f/--fail flag.
[root@bcane ~]# mdadm /dev/md10 -f /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md10  

[root@bcane ~]# mdadm --detail /dev/md10
/dev/md10:  
 Version : 1.2  
 Creation Time : Sat Jul 2 13:56:38 2011  
 Raid Level : raid1  
 Array Size : 26212280 (25.00 GiB 26.84 GB)  
 Used Dev Size : 26212280 (25.00 GiB 26.84 GB)  
 Raid Devices : 2  
 Total Devices : 2  
 Persistence : Superblock is persistent  

 Update Time : Sat Jul 2 14:00:18 2011  
 State : active, degraded  
Active Devices : 1  
Working Devices : 1  
Failed Devices : 1  
 Spare Devices : 0  

 Name : bcane.virtuals.local:10 (local to host bcane.virtuals.local)  
 UUID : 10a96ed5:92dc48e6:04b2bf43:3539e089  
 Events : 19  

 Number Major Minor RaidDevice State  
 0 0 0 0 removed  
 1 8 49 1 active sync /dev/sdd1  

 0 8 33 - faulty spare /dev/sdc1

Now that the drive is marked as failed/faulty you can remove it using the -r/--remove flag.
[root@bcane ~]# mdadm /dev/md10 -r /dev/sdc1
mdadm: hot removed /dev/sdc1 from /dev/md10  

[root@bcane ~]# mdadm --detail /dev/md10
/dev/md10:  
 Version : 1.2  
 Creation Time : Sat Jul 2 13:56:38 2011  
 Raid Level : raid1  
 Array Size : 26212280 (25.00 GiB 26.84 GB)  
 Used Dev Size : 26212280 (25.00 GiB 26.84 GB)  
 Raid Devices : 2  
 Total Devices : 1  
 Persistence : Superblock is persistent  

 Update Time : Sat Jul 2 14:02:04 2011  
 State : active, degraded  
Active Devices : 1  
Working Devices : 1  
Failed Devices : 0  
 Spare Devices : 0  

 Name : bcane.virtuals.local:10 (local to host bcane.virtuals.local)  
 UUID : 10a96ed5:92dc48e6:04b2bf43:3539e089  
 Events : 20  

 Number Major Minor RaidDevice State  
 0 0 0 0 removed  
 1 8 49 1 active sync /dev/sdd1

If you want to re-add the device you can do so with the -a flag.
[root@bcane ~]# mdadm /dev/md10 -a /dev/sdc1
mdadm: re-added /dev/sdc1  

[root@bcane ~]# mdadm --detail /dev/md10
/dev/md10:  
 Version : 1.2  
 Creation Time : Sat Jul 2 13:56:38 2011  
 Raid Level : raid1  
 Array Size : 26212280 (25.00 GiB 26.84 GB)  
 Used Dev Size : 26212280 (25.00 GiB 26.84 GB)  
 Raid Devices : 2  
 Total Devices : 2  
 Persistence : Superblock is persistent  

 Update Time : Sat Jul 2 18:02:21 2011  
 State : clean, degraded, recovering  
Active Devices : 1  
Working Devices : 2  
Failed Devices : 0  
 Spare Devices : 1  

Rebuild Status : 4% complete  

 Name : bcane.virtuals.local:10 (local to host bcane.virtuals.local)  
 UUID : 10a96ed5:92dc48e6:04b2bf43:3539e089  
 Events : 23  

 Number Major Minor RaidDevice State  
 0 8 33 0 spare rebuilding /dev/sdc1  
 1 8 49 1 active sync /dev/sdd1

One thing to keep an eye out for is that you need to specify the raid device when running these commands. If they are performed without specifying the raid device the flags take on a new meaning.

Hotplug

mdadm versions < 3.1.2

In older version of mdadm, the hotplug & hot-unplug support is present, but for full automatic functionality, we need to employ some bits of scripting. First of all, look what madm provides by manually trying its features from command line:

Hot-unplug from command line

  • If the physical disk is still alive:
mdadm --fail /dev/mdX /dev/sdYZ
mdadm --remove /dev/mdX /dev/sdYZ 
Do this for all RAIDs containing partitions of the failed disk. Then the disk can be hot-unplugged without any problems
  • If the physical disk is dead or unplugged, just do
mdadm /dev/mdX --fail detached --remove detached

Fully automated hotplug and hot-unplug using UDEV rules

In case you need fully automatic hot-plug and hot-unplug events handling, the UDEV "add" and "remove" events can be used for this.
Note: the following code had been validated on Linux Debian 5 (Lenny), with kernel 2.6.26 and udevd version 125.
Important notes:
  • the rule for "add" event MUST be placed in a file positioned after the "persistent_storage.rules" file, because it uses the ENV{ID_FS_TYPE} condition, which is produced by the persistent_storage.rules file during the "add" event processing.
  • The rule for "remove" event can reside in any file in the UDEV rules chain, but let's keep it together with the "add" rule :-)
For this reason, in Debian Lenny I placed the mdadm hotplug rules in file /etc/udev/rules.d/66-mdadm-hotplug.rules This is the content of the file:
SUBSYSTEM!="block", GOTO="END_66_MDADM"
ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="END_66_MDADM"
ACTION=="add",  RUN+="/usr/local/sbin/handle-add-old $env{DEVNAME}"
ACTION=="remove", RUN+="/usr/local/sbin/handle-remove-old $name"
LABEL="END_66_MDADM"
(these rules are based on the UDEV rules contained in the hot-unplug patches by Doug Ledford)
And here are the scripts which are called from these rules:
#!/bin/bash
#This is the /usr/local/sbin/handle-add-old
MDADM=/sbin/mdadm
LOGGER=/usr/bin/logger
mdline=`mdadm --examine --scan $1` #mdline contains something like "ARRAY /dev/mdX level=raid1 num-devices=2 UUID=..."
mddev=${mdline#* }                 #delete "ARRAY " and return the result as mddev
mddev=${mddev%% *}                 #delete everything behind /dev/mdX
$LOGGER $0 $1
if [ -n "$mddev" ]; then
   $LOGGER "Adding $1 into RAID device $mddev"
   log=`$MDADM -a $mddev $1 2>&1`
   $LOGGER "$log"
fi
#!/bin/bash
#This is the /usr/local/sbin/handle-remove-old
MDADM=/sbin/mdadm
LOGGER=/usr/bin/logger
$LOGGER "$0 $1"
mdline=`grep $1 /proc/mdstat`  #mdline contains something like "md0 : active raid1 sda1[0] sdb1[1]"
mddev=${mdline% :*}            #delete everything from " :" till the end of line and return the result as mddev
$LOGGER "$0: Trying to remove $mdpart from $mddev"
log=`$MDADM /dev/$mddev --fail detached --remove detached 2>&1`
$LOGGER $log

mdadm versions > 3.1.2

The hot-unplug support introduced in mdadm version 3.1.2 removed the necessity of scripting you see above. If your Linux distribution contains this or later version of mdadm, you hopefully have fully automatic hotplug and hot-unplug without any hassles.

Examples of behavior WITHOUT the automatic hotplug/hot-unplug

Let's have the following RAID configuration:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      3903680 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      224612672 blocks [2/2] [UU]
The md0 contains the system, md1 is for data (but is not used yet).

Hot-unplug

If we hot-unplug the disk /dev/sda, the /proc/mdstat will show:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[2](F) sdb1[1]
      3903680 blocks [2/1] [_U]

md1 : active raid1 sda2[0] sdb2[1]
      224612672 blocks [2/2] [UU]
We see that sda1 has role [2]. Since RAID1 needs only 2 components - [0] and [1], the [2] means "Spare disk". And it is (F)ailed.
But why the system thinks that /dev/sda2 in /dev/md1 is still OK? Because my system hasn't tried to access /dev/md1 yet (I have no data on /dev/md1). The /dev/sda2 will be marked as fault automatically as soon as I try to access /dev/md1:
# dd if=/dev/md1 of=/dev/null bs=1 count=1
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.0184819 s, 0.1 kB/s
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[2](F) sdb1[1]
      3903680 blocks [2/1] [_U]

md1 : active raid1 sda2[2](F) sdb2[1]
      224612672 blocks [2/1] [_U]
At any point after the disk has been unplugged, we can remove its partitions from an array only by this command:
# mdadm /dev/md0 --fail detached --remove detached
mdadm: hot removed 8:1

Removing a RAID Device

To remove an existing RAID device, first deactivate it by running the following command as root:
mdadm --stop raid_device
Once deactivated, remove the RAID device itself:
mdadm --remove raid_device
Finally, zero superblocks on all devices that were associated with the particular array:
mdadm --zero-superblock component_device

Example 6.5. Removing a RAID device
Assume the system has an active RAID device, /dev/md3, with the following layout (that is, the RAID device created in Example 6.4, “Extending a RAID device”):
~]# mdadm --detail /dev/md3 | tail -n 4
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
In order to remove this device, first stop it by typing the following at a shell prompt:
~]# mdadm --stop /dev/md3
mdadm: stopped /dev/md3
Once stopped, you can remove the /dev/md3 device by running the following command:
~]# mdadm --remove /dev/md3
Finally, to remove the superblocks from all associated devices, type:
~]# mdadm --zero-superblock /dev/sda1 /dev/sdb1 /dev/sdc1

No comments:

Post a Comment