Raid 0+1 Failure Cases VS. Raid 1+0

Question

I have heard that RAID 1+0 is more fault-reliant than RAID 0+1, because a secondary drive failure is more likely to cause data loss in RAID 0+1 than RAID 1+0.

RAID 0+1 Example

In the above image, if "Disk 1" fails, which other disk failures will cause data loss? What I have read seems to indicate that the loss of any drive in "Group 2" will cause data loss, but the reasoning behind this is unclear to me. If we lose "Disk 5", why would this cause data loss? It seems to me that there is sufficient information to recover the full state of data -- combining "Disk 4" + "Disk 2" + "Disk 3", for instance, should have all of the necessary information to continue functioning properly without data loss.

In this case, why would the loss of "Disk 1" and "Disk 5" cause data loss?

Thanks in advance!

en.wikipedia.org/wiki/Mdadm :: A single-drive failure in a RAID 10 configuration results in one of the lower-level mirrors entering degraded mode, but the top-level stripe performing normally (except for the performance hit). A single-drive failure in a RAID 0+1 configuration results in one of the lower-level stripes completely failing, and the top-level mirror entering degraded mode. — frostschutz
– frostschutz, Commented Jul 4, 2024 at 7:29
If you put RAID on RAID (using Linux mdadm) then mdadm is not aware that this is a 0+1 setup; so while it may be possible to recover, mdadm won't know how and simply enters the failed state if both underlying raid0 failed. In case of RAID 1+0 this only happens if both sides of the same RAID1 fail. So you could also say it comes down to implementation. — frostschutz
– frostschutz, Commented Jul 4, 2024 at 7:34
The problem is once a single drive fails, thus causing the entire group to fail, that group no longer receives any updates. The data on the remaining drives, even if those drives are still working normally, it will be unusable data since its already outdated. So an implementation that can recover would have to keep mirroring data onto the remaining drives of an already failed group; it could be done that way but is that still considered 0+1 then? — frostschutz
– frostschutz, Commented Jul 4, 2024 at 7:42

telcoM · Accepted Answer · 2024-07-04 09:16:29Z

This comes down to implementation details of the RAID layer(s).

For example, if you use a hardware RAID to do the striping, and then software RAID on top to mirror the stripe groups, then the software RAID that's doing the mirroring won't see the individual disks, just the two devices generated by the hardware RAID representing each stripe group.

In such a configuration, when Disk 1 fails, the hardware RAID controller will return errors on the Group 1 device. The software RAID on top must then consider the entire Group 1 as failed, as it won't see the individual disks.

If Disk 5 then fails, the Group 2 device will start returning errors too, and as far as the software RAID doing the mirroring is concerned, that's a double failure - data is lost. Game over.

If you try using disks 4, 2 and 3 for recovery, there's the problem that after Disk 1 failed, the software RAID will have stopped updating the entire Group 1. So Disk 4 would have newer data than Disks 2 and 3, unless both disks failed at the exact same time... which is unlikely. And because of striping, any contiguous piece of data that is longer than one stripe will have a risk of incorporating parts of both "older" and "newer" sets of stripes, resulting in a corrupted mess.

If both striping and mirroring are done by the same RAID implementation, i.e. either just a hardware RAID controller that can do "RAID 10" or "RAID 0+1", or just a software RAID implementation that can do the same, then the implementation might be smart enough to keep updating Disks 2 and 3 after Disk 1 fails, even though the Group 1 stripe set will no longer be complete. If disk 5 then fails too, the controller may be smart enough to see that Disks 4 + 2 + 3 together form a valid set, and keep on running.

Whenever the same RAID implementation handles both the striping and the mirroring, modern implementations usually work the way you seem to be thinking - they track the health of each copy of each set of stripes and will keep working as long as a complete set of stripes can be found.

However, this is not something you should blindly trust: you should carefully research your RAID implementation in advance, so that when (not "if"!) disks start failing, you'll know what your RAID implementation can and cannot do.

And if you layer different RAID implementations on top of each other (for example, if you use OS's built-in software RAID to mirror data between two large SAN storage systems located in different buildings for disaster tolerance) you should carefully think through the failure scenarios - in the design phase, before you even start implementing your set-up.

also notice that the design phase mentioned (in bold!) is the place where you'd really think hard about what the requirements of your system are. For example, mirroring will slow down writes to the slowest of all mirrors. This gets really really significant once you do random access on spinning media, because if there's seeks to be done that cannot wait, seek times become very random very often, and you'll be filling anything that could buffer write data very quickly. You won't win much at all with striping in that case. Read-heavy use cases on the other hand might be fine. — Marcus Müller
– Marcus Müller, Commented Jul 4, 2024 at 9:54
You need to also think about against which failure modes you're really protecting yourself. I find simple mirroring to be consistently insufficient without parity/checksums. Great, you have two copies of the same data, now one has suffered a bit flip. Which one is right? You seem to have enough disks, this might really call for the more complex RAID 5 or RAID 6 setups, or you go for something less classical like ZFS zRAID. But in case you are just insuring yourself against complete failure of complete disks, sure, mirroring is good. — Marcus Müller
– Marcus Müller, Commented Jul 4, 2024 at 9:55

Stack Exchange Network

Raid 0+1 Failure Cases VS. Raid 1+0

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Raid 0+1 Failure Cases VS. Raid 1+0

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions