Drive failure(s) in linux mdadm raid array. Help!

Question

It looks like my system is suffering from some sort of catastrophic failure, and I'm panicking at the moment, not sure what to do.

I had a 3 drive raid 10 array. I noticed this morning that I was having trouble accessing the array (all my photos are on there). I checked mdadm and it said that one drive was dropped from the array (drive 2). I think this may have been because the computer was shutdown accidentally (there was a blackout), and the drive was kicked as a result.

I tried adding the drive back, and that worked. Then I checked the progress of the rebuild with mdstat, and it is rebuilding at 10 kb/s. Yes. It would take years to rebuild a 3 TB drive at that speed. I check dmseg, and I was getting a bunch of I/O errors for drive 3. It seems that drive 3 has some kind of hardware failure. I checked the drive's health with the disks tool (I think it's from gnome, gnome-disk-tool or something), and it had almost 6000 bad sectors, but apart from that it said it was OK.

So now I'm panicking, thinking that the drive 2 was actually still good and now when I readded it, it's resynching, probably destroying the good data it has. I then tried turning off the computer (I read it was save to do so even when mdstat is resyncing).

Unfortunately, the shut-off is not working. Pressing "ESC" showed me the terminal and it's displaying a bunch of "print_req_error: I/O error, dev sdd, sector ...." type errors. I don't know what to do. Just wait? It's been going on for 30 minutes like this.

Any advice?

Power-off the entire system NOW until you get some (good) advice. Don't try to shutdown cleanly. Just pull the plug. — Chris Davies
– Chris Davies, Commented Jul 7, 2019 at 21:56
Ok, just pulled the plug. I only have old backups (about 2 years old). I moved houses, and never got around to rebacking everything up. The irony is that I was actually looking to do this now. I thought that I was relatively safe since I'd notice if one drive failed and I could deal with it. — user361233
– user361233, Commented Jul 7, 2019 at 21:59
OK. Please edit your question to explain how you're running RAID 10 with (only) three drives. — Chris Davies
– Chris Davies, Commented Jul 7, 2019 at 21:59
I'm about to head to bed so I can't follow this through to its conclusion, but one possible option is to boot a system rescue disk with only one disk of your RAID array installed. DO NOT let it try to start your array. Then carefully, and using something like mount -o ro,noload /dev/... mount each partition's filesystem in turn. DON'T START THE RAID - I'm referring to the filesystem inside the RAID that you're never supposed to see. Find out which disk has the most complete set of filesystems (i.e. boot three times, once for each disk)... — Chris Davies
– Chris Davies, Commented Jul 7, 2019 at 22:03

user361233 · Accepted Answer · 2019-07-08 21:16:34Z

It's a miracle. Somehow I got the array back up and running. Here's what I did:

As mentioned in the original post, the system wasn't shutting down because it was still trying to write something to the faulty drive. I followed the advice of user361233 and pulled the plug.
I stopped panicking. With the computer shut off, I could think about the next steps.
I went and bought two new 3TB drives.
I slept on it, and today I booted the computer with a live-session usb (manjaro) while only having a single drive plugged in at a time (so I rebooted 3 times, once for each disk). I checked the health of the disks with kde partition manager. The SMART status said all disks were fine. I then hoped that whatever hardware failure that was going on with disk 3 of the array had, at least temporarily, subsided.
I plugged all three disks in, and restarted (again with the live-session usb). In retrospect, manjaro wasn't the best choice for a recovery environment since it already had mdadm installed, and as a result it already tried to start the array (as /dev/md127). I discovered this when manually trying to start the array.
```
mdadm --assemble --scan
```
When I did this, it complained that there was already an active array (or something to that effect). I remembered about how /dev/md127 is sometimes automatically started, so I stopped that array and tried to manually start mine.
```
mdadm --stop /dev/md127
mdadm --assemble --scan
```
This didn't work either. I then tried by actually specifying the partitions on each disk to use in assembling the array.
```
mdadm  --assemble /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1
```
This worked! Then I checked the status of the array with mdadm --examine /dev/md0. Strangely enough, it said the array was up with 3/3 disks. When I checked cat /proc/mdstat there was no indication that drive 2 of the array was rebuilding (remember, initially drive 2 was the one that was kicked out of the array after a powerout). Some kind of miracle happened and the drive 2 that was rebuilding at a glacial pace when I shut down the computer must of actually been fine and somehow mdadm accepted it into the array this time.
I then tried accessing the array to copy over my data to the new disk I bought. It didn't work. Just listing the contents of a directory resulted in the ls command hanging. There were once again a bunch of I/O errors in dmesg specifically relating to disk 3 (\dev\sdd).
I tried cancelling the ls command, and it took a few CTRL-C attempts and waiting a few minutes before I got the command prompt back. At this point I was tried to once again check the array with mdadm --examine /dev/md0. It then recognized the disk 3 as having a hardware failure and kicked it from the array. The array now only has disk 1 (/dev/sdb, the completely healthy drive), and disk 2 (/dev/sdc, the drive that was initially kicked from the array at the very start of this all).
I tried accessing the array once again, and now it worked! I was able to list all my files with ls, and even with the file browser. At this point I started copying over all my important files to the extra drive that I bought. I'm now almost done this process.

In the end, this is a good reminder to make sure that I regularly back up my files on a separate device. I was in the habit of doing so, but have been negligent for the past year or two. Sorry if this post got too long, and is not the most specific. I don't remember the exact outputs of each command.

TL;DR I turned off the computer and stopped panicking. I then had time to make a plan to approach the problem. It's a good reminder to keep up-to-date backups.

Really pleased to see you've recovered it. And yes, take those backups - especially as you now have some spare disks :-) — Chris Davies
– Chris Davies, Commented Jul 9, 2019 at 9:38
Could you also explain what kind of RAID setup you had in reality? RAID 10 array is striped mirror and as such it will always have even number of drives unless it's degraded already. Logically the minimum number of healthy RAID 10 system is 4 (two disks mirrored each, and each mirror striped). Were you actually running 3 disk RAID 5? If you're unsure, grep ^md /proc/mdstat will tell the correct answer. — Mikko Rantalainen
– Mikko Rantalainen, Commented Sep 16, 2022 at 9:36

frostschutz · Accepted Answer · 2019-07-08 04:26:59Z

It's difficult to answer your question, and this is too long for a comment, so just some general pointers.

So now I'm panicking, thinking that the drive 2 was actually still good and now when I readded it, it's resynching, probably destroying the good data it has.

Unless there is a kernel bug, re-adding a disk (in the same role and same offset it had before) does not "destroy" data. It just re-writes most of the same data that was already there, no harm done.

the role might change if there was more than one drive missing from the array
the offset usually only changes if you add sdx when it was sdx1 before
if very unlucky, offset might also change if it was in a weird state before

The main problem about kicked drive, even if the drive was innocent, is that it's no longer part of the array. As soon as the array is mounted in write mode, data on the array is modified, and the data on the kicked drive is not updated along with it, so it becomes outdated and as such no longer "good".

I checked the drive's health with the disks tool (I think it's from gnome, gnome-disk-tool or something), and it had almost 6000 bad sectors, but apart from that it said it was OK.

You can't do data recovery if your drives have issues. If those 6000 bad sectors didn't appear over night, you should have replaced that drive a long time ago. RAIDs die if you don't selftest, monitor and actually replace ASAP any drives that are going bad.

Get new drives, use ddrescue to copy what you can from the old drives, then use copy-on-write overlays for data recovery experiments. With overlays you can write without modifying the original (so you don't have to re-do the disk copy and don't need a copy of the copy neither). But overlays, too, require drives that work, you can't do it with drives that have errors.

Thank you for your feedback. It's very useful to know that re-adding a disk in the same role and same offset as it initially was does not destroy any data. I think that this is what actually saved me. When I rebooted in a live-session, I was able to manually recreate the array with mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1, and this time it thought that all disks were up to date. Then mdadm kicked off the actual faulty disk, leaving me with a degraded, but functional array. — user361233
– user361233, Commented Jul 8, 2019 at 21:23
If you have a failing drive and you need to save data from it, consider the cost of the data. If data is absolutely critical, disconnect the power immediately and take the device to professional data recovery service. It will be insanely expensive but that's your best bet. If you are willing to risk the data, start by imaging the broken drive using ddrescue and then proceed with copy-on-write experiments on the drive that contains the data that ddrescue was able to read. Do not use the broken drive for anything else but read-only ddrescue source disk. — Mikko Rantalainen
– Mikko Rantalainen, Commented Sep 16, 2022 at 9:42

Stack Exchange Network

Drive failure(s) in linux mdadm raid array. Help!

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Drive failure(s) in linux mdadm raid array. Help!

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions