Section 6.2. Managing RAID

6.2. Managing RAID

Redundant Arrays of Inexpensive Disks (RAID) is a technology for boosting storage performance and reducing the risk of data loss due to disk error. It works by storing data on multiple disk drives and is well supported by Fedora. It's a good idea to configure RAID on any system used for serious work.

6.2.1. How Do I Do That?

RAID can be managed by the kernel, by the kernel working with the motherboard BIOS, or by a separate computer on an add-in card. RAID managed by the BIOS is called dmraid; while supported by Fedora Core, it does not provide any significant benefits over RAID managed solely by the kernel on most systems, since all the work is still performed by the main CPU.

Using dmraid can thwart data-recovery efforts if the motherboard fails and another motherboard of the same model (or a model with a compatible BIOS dmraid implementation) is not available.

Add-in cards that contain their own CPU and battery-backed RAM can reduce the load of RAID processing on the main CPU. However, on a modern system, RAID processing takes at most 3 percent of the CPU time, so the expense of a separate, dedicated RAID processor is wasted on all but the highest-end servers. So-called RAID cards without a CPU simply provide additional disk controllers, which are useful because each disk in a RAID array should ideally have its own disk-controller channel.

There are six "levels" of RAID that are supported by the kernel in Fedora Core, as outlined in Table 6-3.

Table 6-3. RAID levels supported by Fedora Core
RAID Level Description Protection against drive failure Write performance Read performance Number of drives Capacity
Linear Linear/Append. Devices are concatenated together to make one large storage area (deprecated; use LVM instead). No. Normal. Normal 2 Sum of all drives
0 Striped. The first block of data is written to the first block on the first drive, the second block of data is written to the first block on the second drive, and so forth. No. Normal to normal multiplied by the number of drives, depending on application. Multiplied by the number of drives 2 or more Sum of all drives
1 Mirroring. All data is written to two (or more) drives. Yes. As long as one drive is working, your data is safe. Normal. Multiplied by the number of drives 2 or more Equal to one drive
4 Dedicated parity. Data is striped across all drives except that the last drive gets parity data for each block in that "stripe." Yes. One drive can fail (but any more than that will cause data loss). Reduced: two reads and one write for each write operation. The parity drive is a bottleneck. Multiplied by the number of drives minus one 3 or more Sum of all drives except one
5 Distributed parity. Like level 4, except that the drive used for parity is rotated from stripe to stripe, eliminating the bottleneck on the parity drive. Yes. One drive can fail. Like level 4, except with no parity bottleneck. Multiplied by the number of drives minus one 3 or more Sum of all drives except one
6 Distributed error-correcting code. Like level 5, but with redundant information on two drives. Yes. Two drives can fail. Same as level 5. Multiplied by the number of drives minus two 4 or more Sum of all drives except two

For many desktop configurations, RAID level 1 (RAID 1) is appropriate because it can be set up with only two drives. For servers, RAID 5 or 6 is commonly used.

Although Table 6-3 specifies the number of drives required by each RAID level, the Linux RAID system is usually used with disk partitions, so a partition from each of several disks can form one RAID array, and another set of partitions from those same drives can form another RAID array.

RAID arrays should ideally be set up during installation, but it is possible to create them after the fact. The mdadm command is used for all RAID administration operations; no graphical RAID administration tools are included in Fedora.

6.2.1.1. Displaying Information About the Current RAID Configuration

The fastest way to see the current RAID configuration and status is to display the contents of /proc/mdstat:

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hdc1[1] hda1[0]
      102144 blocks [2/2] [UU]

md1 : active raid1 hdc2[1] hda3[0]
      1048576 blocks [2/2] [UU]

md2 : active raid1 hdc3[1]
      77023232 blocks [2/1] [_U]

This display indicates that only the raid1 ( mirroring) personality is active, managing three device nodes:

md0: This is a two-partition mirror, incorporating /dev/hda1 (device 0) and /dev/hdc1 (device 1). The total size is 102,144 blocks (about 100 MB). Both devices are active.
md1: This is another two-partition mirror, incorporating /dev/hda3 as device 0 and /dev/hdc2 as device 1. It's 1,048,576 blocks long (1 GB), and both devices are active.
md2: This is yet another two-partition mirror, but only one partition (/dev/hdc3) is present. The size is about 75 GB.

The designations md0, md1, and md2 refer to multidevice nodes that can be accessed as /dev/md0, /dev/md1, and /dev/md2.

You can get more detailed information about RAID devices using the mdadm command with the -D (detail) option. Let's look at md0 and md2:

# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Mon Aug  9 02:16:43 2004
     Raid Level : raid1
     Array Size : 102144 (99.75 MiB 104.60 MB)
    Device Size : 102144 (99.75 MiB 104.60 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Mar 28 04:04:22 2006
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : dd2aabd5:fb2ab384:cba9912c:df0b0f4b
         Events : 0.3275

    Number   Major   Minor   RaidDevice State
       0       3        1        0      active sync   /dev/hda1
       1      22        1        1      active sync   /dev/hdc1
# mdadm -D /dev/md2
/dev/md2:
        Version : 00.90.03
  Creation Time : Mon Aug  9 02:16:19 2004
     Raid Level : raid1
     Array Size : 77023232 (73.46 GiB 78.87 GB)
    Device Size : 77023232 (73.46 GiB 78.87 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Tue Mar 28 15:36:04 2006
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 31c6dbdc:414eee2d:50c4c773:2edc66f6
         Events : 0.19023894

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1      22        3        1      active sync   /dev/hdc3

Note that md2 is marked as degraded because one of the devices is missing.

6.2.1.2. Creating a RAID array

To create a RAID array, you will need two block devicesusually, two partitions on different disk drives.

If you want to experiment with RAID, you can use two USB flash drives; in these next examples, I'm using some 64 MB flash drives that I have lying around. If your USB drives are auto-mounted when you insert them, unmount them before using them for RAID, either by right-clicking on them on the desktop and selecting Unmount Volume or by using the umount command.

The mdadm option --create is used to create a RAID array:

# mdadm --create -n 2 -l raid1 /dev/md0 /dev/sdb1 /dev/sdc1
mdadm: array /dev/md0 started.

There are a lot of arguments used here:

--create: Tells mdadm to create a new disk array.
-n 2: The number of block devices in the array.
-l raid1: The RAID level.
/dev/md0: The name of the md device.
/dev/sdb1 /dev/sdc1: The two devices to use for this array.

/proc/mdstat shows the configuration of /dev/md0:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      63872 blocks [2/2] [UU]

unused devices: <none>

If you have three or more devices, you can use RAID 5, and if you have four or more, you can use RAID 6. This example creates a RAID 5 array:

# mdadm --create -n 3 -l raid5 /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdf1
mdadm: largest drive (/dev/sdb1) exceed size (62464K) by more than 1%
Continue creating array? y
mdadm: array /dev/md0 started.

Note that RAID expects all of the devices to be the same size. If they are not, the array will use only the amount of storage equal to the smallest partition on each of the devices; for example, if given partitions that are 50 GB, 47.5 GB, and 52 GB in size, the RAID system will use 47.5 GB in each of the three partitions, wasting 5 GB of disk space. If the variation between devices is more than 1 percent, as in this case, mdadm will prompt you to confirm that you're aware of the difference (and therefore the wasted storage space).

Once the RAID array has been created, make a filesystem on it, as you would with any other block device:

# mkfs -t ext3 /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
16000 inodes, 63872 blocks
3193 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=65536000
8 block groups
8192 blocks per group, 8192 fragments per group
2000 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345

Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Then mount it and use it:

# mkdir /mnt/raid
# mount /dev/md0 /mnt/raid

Alternately, you can use it as a PV under LVM. In this example, a new VG test is created, containing the LV mysql:

# pvcreate /dev/md0
Physical volume "/dev/md0" successfully created
# vgcreate test /dev/md0
Volume group "test" successfully created
# lvcreate test --name mysql --size 60M
Logical volume "mysql" created
# mkfs -t ext3 /dev/test/mysql
mke2fs 1.38 (30-Jun-2005)
...(Lines skipped)...
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# mkdir /mnt/mysql
# mount /dev/test/mysql /mnt/mysql

6.2.1.3. Handling a drive failure

You can simulate the failure of a RAID array element using mdadm:

# mdadm --fail /dev/md0 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0

The "failed" drive is marked with the symbol (F) in /proc/mdstat:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2](F) sdb1[0]
      63872 blocks [2/1] [U_]

unused devices: <none>

To place the "failed" element back into the array, remove it and add it again:

# mdadm --remove /dev/md0 /dev/sdc1
mdadm: hot removed /dev/sdc1

# mdadm --add /dev/md0 /dev/sdc1
mdadm: re-added /dev/sdc1
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      63872 blocks [2/1] [U_]
      [>....................]  recovery =  0.0% (928/63872) finish=3.1min speed=309K/sec

unused devices: <none>

If the drive had really failed (instead of being subject to a simulated failure), you would replace the drive after removing it from the array and before adding the new one.

Do not hot-plug disk drivesi.e., physically remove or add them with the power turned onunless the drive, disk controller, and connectors are all designed for this operation. If in doubt, shut down the system, switch the drives while the system is turned off, and then turn the power back on.

If you check /proc/mdstat a short while after readding the drive to the array, you can see that the RAID system automatically rebuilds the array by copying data from the good drive(s) to the new drive:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      63872 blocks [2/1] [U_]
      [=============>.......]  recovery = 65.0% (42496/63872) 
            finish=0.8min speed=401K/sec

unused devices: <none>

The mdadm command shows similar information in a more verbose form:

# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu Mar 30 01:01:00 2006
     Raid Level : raid1
     Array Size : 63872 (62.39 MiB 65.40 MB)
    Device Size : 63872 (62.39 MiB 65.40 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Mar 30 01:48:39 2006
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 65% complete

           UUID : b7572e60:4389f5dd:ce231ede:458a4f79
         Events : 0.34

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      spare rebuilding   /dev/sdc1

6.2.1.4. Stopping and restarting a RAID array

A RAID array can be stopped anytime that it is not in useuseful if you have built an array incorporating removable or external drives that you want to disconnect. If you're using the RAID device as an LVM physical volume, you'll need to deactivate the volume group so the device is no longer considered to be in use:

# vgchange test -an
  0 logical volume(s) in volume group "test" now active

The -an argument here means activated: no. (Alternately, you can remove the PV from the VG using vgreduce.)

To stop the array, use the --stop option to mdadm:

# mdadm --stop /dev/md0

The two steps above will automatically be performed when the system is shut down.

To restart the array, use the --assemble option:

# mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1
mdadm: /dev/md0 has been started with 2 drives.

To configure the automatic assembly of this array at boot time, obtain the array's UUID (unique ID number) from the output of mdadm -D:

# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu Mar 30 02:09:14 2006
     Raid Level : raid1
     Array Size : 63872 (62.39 MiB 65.40 MB)
    Device Size : 63872 (62.39 MiB 65.40 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Mar 30 02:19:00 2006
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 5fccf106:d00cda80:daea5427:1edb9616
         Events : 0.18

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

Then create the file /dev/mdstat if it doesn't exist, or add an ARRAY line to it if it does:

DEVICE partitions
MAILADDR root
ARRAY /dev/md0 uuid=c27420a7:c7b40cc9:3aa51849:99661a2e

In this file, the DEVICE line identifies the devices to be scanned (all partitions of all storage devices in this case), and the ARRAY lines identify each RAID array that is expected to be present. This ensures that the RAID arrays identified by scanning the partitions will always be assigned the same md device numbers, which is useful if more than one RAID array exists in the system. In the mdadm.conf files created during installation by Anaconda, the ARRAY lines contain optional level= and num-devices= enTRies (see the next section).

If the device is a PV, you can now reactivate the VG:

# vgchange test -ay
1 logical volume(s) in volume group "test" now active

6.2.1.5. Monitoring RAID arrays

The mdmonitor service uses the monitor mode of mdadm to monitor and report on RAID drive status.

The method used to report drive failures is configured in the file /etc/mdadm.conf. To send email to a specific email address, add or edit the MAILADDR line:

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR raid-alert
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=dd2aabd5:fb2ab384:cba9912c:df0b0f4b
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=2b0846b0:d1a540d7:d722dd48:c5d203e4
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=31c6dbdc:414eee2d:50c4c773:2edc66f6

When mdadm.conf is configured by Anaconda, the email address is set to root. It is a good idea to set this to an email alias, such as raid-alert, and configure the alias in the /etc/aliases file to send mail to whatever destinations are appropriate:

raid-alert: chris, 4165559999@msg.telus.com

In this case, email will be sent to the local mailbox chris, as well as to a cell phone.

When an event occurs, such as a drive failure, mdadm sends an email message like this:

From root@bluesky.fedorabook.com  Thu Mar 30 09:43:54 2006
Date: Thu, 30 Mar 2006 09:43:54 -0500
From: mdadm monitoring <root@bluesky.fedorabook.com>
To: chris@bluesky.fedorabook.com
Subject: Fail event on /dev/md0:bluesky.fedorabook.com

This is an automatically generated mail message from mdadm
running on bluesky.fedorabook.com

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdc1.

Faithfully yours, etc.

I like the "Faithfully yours" bit at the end!

If you'd prefer that mdadm run a custom program when an event is detectedperhaps to set off an alarm or other notificationadd a PROGRAM line to /etc/mdadm.conf:

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR raid-alert
PROGRAM 
                     /usr/local/sbin/mdadm-event-handler
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=dd2aabd5:fb2ab384:cba9912c:df0b0f4b
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=2b0846b0:d1a540d7:d722dd48:c5d203e4
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=31c6dbdc:414eee2d:50c4c773:2edc66f6

Only one program name can be given. When an event is detected, that program will be run with three arguments: the event, the RAID device, and (optionally) the RAID element. If you wanted a verbal announcement to be made, for example, you could use a script like this:

#!/bin/bash
#
# mdadm-event-handler :: announce RAID events verbally
#

# Set up the phrasing for the optional element name
if [ "$3" ]
then
        E=", element $3"
fi

# Separate words (RebuildStarted -> Rebuild Started)
$T=$(echo $1|sed "s/\([A-Z]\)/ \1/g")

# Make the voice announcement and then repeat it
echo "Attention! RAID event: $1 on $2 $E"|festival --tts
sleep 2
echo "Repeat: $1 on $2 $E"|festival --tts

When a drive fails, this script will announce something like "Attention! RAID event: Failed on /dev/md0, element /dev/sdc1" using the Festival speech synthesizer. It will also announce the start and completion of array rebuilds and other important milestones (make sure you keep the volume turned up).

6.2.1.6. Setting up a hot spare

When a system with RAID 1 or higher experiences a disk failure, the data on the failed drive will be recalculated from the remaining drives. However, data access will be slower than usual, and if any other drives fail, the array will not be able to recover. Therefore, it's important to replace a failed disk drive as soon as possible.

When a server is heavily used or is in an inaccessible locationsuch as an Internet colocation facilityit makes sense to equip it with a hot spare. The hot spare is installed but unused until another drive fails, at which point the RAID system automatically uses it to replace the failed drive.

To create a hot spare when a RAID array is initially created, use the -x argument to indicate the number of spare devices:

# mdadm --create -l raid1 -n 2 -x 1 /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdf1
mdadm: array /dev/md0 started.
$ cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdf1[2](S) sdc1[1] sdb1[0]
      62464 blocks [2/2] [UU]

unused devices: <none>

Notice that /dev/sdf1 is marked with the symbol (S) indicating that it is the hot spare.

If an active element in the array fails, the hot spare will take over automatically:

$ cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdf1[2] sdc1[3](F) sdb1[0]
      62464 blocks [2/1] [U_]
      [=>...................]  recovery =  6.4% (4224/62464) finish=1.5min speed=603K/sec

unused devices: <none>

When you remove, replace, and readd the failed drive, it will become the hot spare:

# mdadm --remove /dev/md0 /dev/sdc1 
mdadm: hot removed /dev/sdc1
...(Physically replace the failed drive)...
# mdadm --add /dev/md0 /dev/sdc1
mdadm: re-added /dev/sdc1
# cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdc1[2](S) sdf1[1] sdb1[0]
      62464 blocks [2/2] [UU]

unused devices: <none>

Likewise, to add a hot spare to an existing array, simply add an extra drive:

# mdadm --add /dev/md0 /dev/sdh1
mdadm: added /dev/sdh1

Since hot spares are not used until another drive fails, it's a good idea to spin them down (stop the motors) to prolong their life. This command will program all of your drives to stop spinning after 15 minutes of inactivity (on most systems, only the hot spares will ever be idle for that length of time):

# hdparm -S 180 /dev/[sh]d[a-z]

Add this command to the end of the file /etc/rc.d/rc.local to ensure that it is executed every time the system is booted:

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.

touch /var/lock/subsys/local
hdparm -S 180 /dev/[sh]d[a-z]

6.2.1.7. Monitoring drive health

Self-Monitoring, Analysis, and Reporting Technology (SMART) is built into most modern disk drives. It provides access to drive diagnostic and error information and failure prediction.

Fedora provides smartd for SMART disk monitoring. The configuration file /etc/ smartd.conf is configured by the Anaconda installer to monitor each drive present in the system and to report only imminent (within 24 hours) drive failure to the root email address:

/dev/hda -H -m root
/dev/hdb -H -m root
/dev/hdc -H -m root

(I've left out the many comment lines that are in this file.)

It is a good idea to change the email address to the same alias used for your RAID error reports:

/dev/hda -H -m raid-alert
/dev/hdb -H -m raid-alert
/dev/hdc -H -m raid-alert

If you add additional drives to the system, be sure to add additional entries to this file.

6.2.2. How Does It Work?

Fedora's RAID levels 4 and 5 use parity information to provide redundancy. Parity is calculated using the exclusive-OR function, as shown in Table 6-4.

Table 6-4. Parity calculation for two drives
Bit from drive A Bit from drive B Parity bit on drive C
0 0 0
0 1 1
1 0 1
1 1 0

Notice that the total number of 1 bits in each row is an even number. You can determine the contents of any column based on the values in the other two columns (A = B XOR C and B = A XOR C); in this way, the RAID system can determine the content of any one failed drive. This approach will work with any number of drives.

Parity calculations are performed using the CPU's vector instructions (MMX/3DNow/SSE/AltiVec) whenever possible. Even an old 400 MHz Celeron processor can calculate RAID 5 parity at a rate in excess of 2 GB per second.

RAID 6 uses a similar but more advanced error-correcting code (ECC) that takes two bits of data for each row. This code permits recovery from the failure of any two drives, but the calculations run about one-third slower than the parity calculations. In a high-performance context, it may be better to use RAID 5 with a hot spare instead of RAID 6; the protection will be almost as good and the performance will be slightly higher.

6.2.3. What About...

6.2.3.1. ...booting from a RAID array?

During the early stages of the boot process, no RAID driver is available. However, in a RAID 1 (mirroring) array, each element contains a full and complete copy of the data in the array and can be used as though it were a simple volume. Therefore, only RAID 1 can be used for the /boot filesystem.

The GRUB boot record should be written to each drive that contains the /boot filesystem (see Lab 10.5, "Configuring the GRUB Bootloader")

6.2.3.2. ...mixing and matching USB flash drives, USB hard disks, SATA, SCSI, and IDE/ATA drives?

RAID can combine drives of different types into an array. This can be very useful at times; for example, you can use a USB hard disk to replace a failed SATA drive in a pinch.

6.2.3.3. ...mirroring to a remote drive as part of a disaster-recovery plan?

Daily disk or tape backups can be up to 24 hours out of date, which can hamper recovery when your main server is subject to a catastrophic disaster such as fire, circuit-frying power-supply-unit failure, or theft. Up-to-the-minute data backup for rapid disaster recovery requires the use of a remote storage mirror.

iSCSI (SCSI over TCP/IP) is a storage area network technology that is an economical alternative to fiber channel and other traditional SAN technologies. Since it is based on TCP/IP, it is easy to route over long distances, making it ideal for remote mirroring.

Fedora Core includes an iSCSI initiator, the software necessary to remotely access a drive using the iSCSI protocol. The package name is iscsi-initiator-utils. Obviously, you'll need a remote iSCSI drive in order to do remote mirroring, and you'll need to know the portal IP address or hostname on the remote drive.

Create the file /etc/initiatorname.iscsi, containing one line:

InitiatorName=iqn.2006-04.com.fedorabook:bluesky

This configures an iSCSI Qualified Name (IQN) that is globally unique. The IQN consists of the letters iqn, a period, the year and month in which your domain was registered (2006-04), a period, your domain name with the elements reversed, a colon, and a string that you make up (which must be unique within your domain).

Once the initiator name has been set up, start the iscsi server daemon:

# service iscsi start

You may see some error messages the first time you start the iscsi daemon; these can be safely ignored.

Next, use the iscsiadm command to discover the volumes (targets) available on the remote system:

# iscsiadm -m discovery -tst -p 172.16.97.2
[f68ace] 172.16.97.2:3260,1 iqn.2006-04.com.fedorabook:remote1-volume1

If the remote drive requires a user ID and password for connection, edit /etc/iscsid.conf.

The options indicate discovery mode, sendtargets (st) discovery type, and the portal address or hostname. The result that is printed shows the IQN of the remote target, including a node record ID at the start of the line (f68ace). The discovered target information is stored in a database for future reference, and the node record ID is the key to accessing this information.

To connect to the remote system, use iscsiadm to log in:

# iscsiadm -m node --record f68ace --login

The details of the connection are recorded in /var/log/messages:

Mar 30 22:05:18 blacktop kernel: scsi1 : iSCSI Initiator over TCP/IP, v.0.3
Mar 30 22:05:19 blacktop kernel:   Vendor: IET       Model: VIRTUAL-DISK      Rev: 0
Mar 30 22:05:19 blacktop kernel:   Type:   Direct-Access                      ANSI SCSI revision: 04
Mar 30 22:05:19 blacktop kernel: SCSI device sda: 262144 512-byte hdwr sectors (134 MB)
Mar 30 22:05:19 blacktop kernel: sda: Write Protect is off
Mar 30 22:05:19 blacktop kernel: SCSI device sda: drive cache: write back
Mar 30 22:05:19 blacktop kernel: SCSI device sda: 262144 512-byte hdwr sectors (134 MB)
Mar 30 22:05:19 blacktop kernel: sda: Write Protect is off
Mar 30 22:05:19 blacktop kernel: SCSI device sda: drive cache: write back
Mar 30 22:05:19 blacktop kernel:  sda: sda1
Mar 30 22:05:19 blacktop kernel: sd 14:0:0:0: Attached scsi disk sda
Mar 30 22:05:19 blacktop kernel: sd 14:0:0:0: Attached scsi generic sg0 type 0
Mar 30 22:05:19 blacktop iscsid: picking unique OUI for the same target node name iqn.2006-04.com.fedorabook:remote1-volume1
Mar 30 22:05:20 blacktop iscsid: connection1:0 is operational now

This shows that the new device is accessible as /dev/sda and has one partition (/dev/sda1).

You can now create a local LV that is the same size as the remote drive:

# lvcreate main --name database --size 128M
Logical volume "database" created

And then you can make a RAID mirror incorporating the local LV and the remote drive:

# mdadm --create -l raid1 -n 2 /dev/md0 /dev/main/database /dev/sdi1 
mdadm: array /dev/md0 started.

Next, you can create a filesystem on the RAID array and mount it:

# mkfs -t ext3 /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
32768 inodes, 130944 blocks
6547 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
16 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729

Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 27 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# mkdir /mnt/database
# mount /dev/md0 /mnt/database

Any data you write to /mnt/database will be written to both the local volume and the remote drive.

Do not use iSCSI directly over the Internet: route iSCSI traffic through a private TCP/IP network or a virtual private network (VPN) to maintain the privacy of your stored data.

To shut down the remote mirror, reverse the steps:

# umount /mnt/database
# mdadm --stop /dev/md0
# iscsiadm -m node --record f68ace --logout

A connection will be made to the remote node whenever the iSCSI daemon starts. To prevent this, edit the file /etc/iscsid.conf:

#
# Open-iSCSI default configuration.
# Could be located at /etc/iscsid.conf or ~/.iscsid.conf
#
node.active_cnx = 1
node.startup = automatic
#node.session.auth.username = dima
#node.session.auth.password = aloha
node.session.timeo.replacement_timeout = 0
node.session.err_timeo.abort_timeout = 10
node.session.err_timeo.reset_timeout = 30
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Wait = 0
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.MaxConnections = 0
node.cnx[0].iscsi.HeaderDigest = None
node.cnx[0].iscsi.DataDigest = None
node.cnx[0].iscsi.MaxRecvDataSegmentLength = 65536
#discovery.sendtargets.auth.authmethod = CHAP
#discovery.sendtargets.auth.username = dima
#discovery.sendtargets.auth.password = aloha

Change the node.startup line to read:

node.startup = manual

Once the remote mirror has been configured, you can create a simple script file with the setup commands:

#!/bin/bash
iscsiadm -m node --record f68ace --login
mdadm --assemble /dev/md0 /dev/main/database /dev/sdi1
mount /dev/md0 /mnt/database

And another script file with the shutdown commands:

#!/bin/bash
umount /mnt/database
mdadm --stop /dev/md0
iscsiadm -m node --record f68ace --logout

Save these scripts into /usr/local/sbin and enable read and execute permission for both of them:

# chmod u+rx /usr/local/sbin/remote-mirror-start 
# chmod u+rx /usr/local/sbin/remote-mirror-stop

You can also install these as init scripts (see Lab 4.6, "Managing and Configuring Services and Lab 4.12, "Writing Simple Scripts").

6.2.3.4. ...using more than one RAID array, but configuring one hot spare to be shared between them?

This can be done through /etc/mdadm.conf. In each ARRAY line, add a spare-group option:

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 spare-group=red uuid=5fccf106:d00cda80:daea5427:1edb9616
ARRAY /dev/md1 spare-group=red uuid=aaf3d1e1:6f7231b4:22ca60f9:00c07dfe

The name of the spare-group does not matter as long as all of the arrays sharing the hot spare have the same value; here I've used red. Ensure that at least one of the arrays has a hot spare and that the size of the hot spare is not smaller than the largest element that it could replace; for example, if each device making up md0 was 10 GB in size, and each element making up md1 was 5 GB in size, the hot spare would have to be at least 10 GB in size, even if it was initially a member of md1.

6.2.3.5. ...configuring the rebuild rate for arrays?

Array rebuilds will usually be performed at a rate of 1,000 to 20,000 KB per second per drive, scheduled in such a way that the impact on application storage performance is minimized. Adjusting the rebuild rate lets you adjust the trade-off between application performance and rebuild duration.

The settings are accessible through two pseudofiles in /proc/sys/dev/raid, named speed_limit_max and speed_limit_min. To view the current values, simply display the contents:

$ cat /proc/sys/dev/raid/speed_limit*
200000
1000

To change a setting, place a new number in the appropriate pseudo-file:

# echo 40000 >/proc/sys/dev/raid/speed_limit_max

6.2.3.6. ...simultaneous drive failure?

Sometimes, a drive manufacturer just makes a bad batch of disksand this has happened more than once. For example, a few years ago, one drive maker used defective plastic to encapsulate the chips on the drive electronics; drives with the defective plastic failed at around the same point in their life cycles, so that several elements of RAID arrays built using these drives would fail within a period of days or even hours. Since most RAID levels provide protection against a single drive failure but not against multiple drive failures, data was lost.

For greatest safety, it's a good idea to buy disks of similar capacity from different drive manufacturers (or at least different models or batches) when building a RAID array, in order to reduce the likelihood of near-simultaneous drive failure.

6.2.4. Where Can I Learn More?

The manpages for md, mdadm, mdadm.conf, hdparm, smartd, smartd.conf, mkfs, mke2fs, and dmraid
The manpages for iscsid and iscsiadm
The Linux-iSCSI project: http://linux-iscsi.sourceforge.net
The Enterprise iSCSI Target project: http://iscsitarget.sourceforge.net/