[ Problems of RAID5 Arrays | Structure of a RAID5 Array | Recovery Tools | Download ]
This page is about recovery of data from crashed RAID5 arrays. You need a basic understanding of how RAID5 arrays work to make the best use of this page. I don't give any warantee for the correctness of the information or programs below. If that makes you nervous, you should pay money to have your data salvaged by a professional disaster recovery company.
Before attempting to do data recovery yourself take a copy of the disks in the array and work on the copies, otherwise you risk making a bad situation much, much worse. If you have hardware RAID, you should attach the disks to a normal SCSI or IDE controller so that you can access all of the disks.
If you have working backups, don't bother with this page at all, unless you are in it for the challenge.
It's very tempting to assume that a RAID5 array is rock solid reliable, and in certain ways, it is – it can cope with a single disk failure, a common cause of data loss. On the other hand, a failing drive controller can cause the entire array to fail. Data on an array is also susceptible to damage from bad software (not forgetting the firmware on a hardware RAID controller), bad RAM and human error.
RAID is NOT a substitute for backups.
Unfortunately, it's easier to get the time and funding for providing on-line storage than it is for the hardware to backup that data.
It is the latter category that tends to cause difficulties.
Suppose that you have a latent disk error on a part of a disk that is either unused by the filesystem or is a parity block for an area that rarely changes. During normal operation, the existence of this error won't cause any difficulties. If a second disk error causes different disk to be marked as bad, that disk's data will be reconstructed onto a spare disk. The reconstruction will read every disk block on the active disks, so it will detect the latent disk error, marking that disk off-line too. There are no longer enough active disks so the entire array fails.
Regular surface scans are essential to detect latent disk errors.
If you haven't been doing surface scans, your RAID5 array fails and you don't have a hot spare disk, you can take the opportunity to take a backup of the data now while the array is operating in degraded mode.
A latent error in an unused part of the disk shouldn't affect the backup, however, replacing the faulty disk may crash the array.
If you are doing RAID with multiple disks on one SCSI or RAID controller, you can get a bad effect where a controller problem results in a request to the disk timing out. This looks to the controller like a problem with the disk device, so the disk is marked off-line.
The faulty controller is still driving the disks, so the same problem can then happen to another disk. There are then insufficient disks to drive the array.
If there wasn't really a problem with a disk and you know the structure of the metadata, you can put the metadata back as it was when the second drive "failed".
To do this, examine the metadata of all the disks. You should find:
Chances are that the timestamps in the matadata are mere seconds apart. (If they're not, think harder about what went wrong before attempting a recovery.) The metadata block with one failure is from the second drive to fail. If you generate new metadata blocks for the other disks corresponding to this one, you'll end up with a degraded array with most of your data intact. You should force an fsck after doing this.
To construct a RAID5 array, first split your data into blocks of equal length. The size of the blocks will normally be larger than the block size of the underlying disk, 64K being a typical figure.
Now arrange these blocks on all but one of the disks rotating through the disks in sequence. Compute the each bit in the parity blocks (P) in the remaining disk by taking the one's complement (xor) of the corresponding bit on other disks. The one's complement sum over all the disks will be 0. In the picuture, below, each line corresponds to one disk.
To read a block from the array, calculate its position and issue a read to the appropriate disk.
To write to a block, you calculate the position of the data block and the corresponding parity block. You read the data (OD) and parity (OP) blocks into memory. The new data block (ND) is written in the normal way. The new parity block is calculated as NP = OP^ND^OD, where ^ is the one's complement addition operator (xor). The new partity block is then written to disk.
To read a block from a failed disk, cacluate its position, then read the corresponding blocks from all the other disks. The data block will be the one's complement of these blocks.
If you stop here, you have RAID4. This suffers performance problems during writes - every write requires an update to the parity disk so that disk is a bottleneck. (Some systems use RAID4 so that they can grow an array by adding extra disks in parallel with the others. This is much much harder with RAID5.)
RAID5 avoids the bottleneck by interleaving the parity blocks with the data blocks.
To recover a crashed RAID array you'll need to know the following parameters:
Look for a data structure on disk that is larger than stripe size × number of disks. You'll be able to deduce the stripe size and parity layout by looking for discontinuities in the data. If you are using LVM1, the PE allocation table is excellent for this purpose.
When you examing the raw data, don't be surprised to see apparently real data in the parity blocks. This is normal if
Reconstructions that fail due to latent errors do so because they operate on whole disks at a time, marking the entire disk as bad because of one bad sector. A more pragmatic approach works on single stripes or single disk blocks at a time, preferring the believed working disks whenever possible. I don't have a program that automates the recovery process, but this tool will help you do the recovery manually.
The parity stripe is generated by xoring the data stripes, but note that the algorithm works the other way too - you can generate a data stripe by xoring the other data blocks with the parity stripe. Consequently you can generate a disk block on one disk from the corresponding block on the other disks without knowing anything about the stripe length or which disk has the parity stripe.
This program takes a bunch of disks and splices them together in the same way that a RAID controller would. You can then dd the output to another disk.
If you use raidextract to read a filesystem image, you can write the
output to a file and run all the standard filesystem tools (tune2fs, e2fsck) on the file.
Once you've finished, you can the mount the filesystem with
mount -o loop.
./raidextract --window 1024 --stripe 16 --rotate 6 \ --start $((0x41C6E79A00)) --length $((4096*1024*64000)) \ --failed 5 /dev/sd[a-g] | ssh othermachine dd of=RecoveredFilesystem
This program is like the raidextract, except that it only prints what the program would do. It is useful for identifying which disk holds the parity at a given point in the output stream.
These programs will not work prior to Linux kernel 2.4. If you want to run these programs on older systems, or on other Unix platforms, please read the advice on porting the programs.
tweak -l" lets you look at the data without accidentally editing it. At the time of writing, it cannot cope with files larger than 2GB, so you can't examine the disk directly. You can work around this limitation by using dd to copy chunks of the disk into a file.
Peter Benie <email@example.com>