How to protect against both bit rot and device failure with Btrfs

Question

How can you protect simultaneously against bit rot and device failure with Btrfs? Because btrfs only checks data integrity on files when it reads them.

The only solution I can think of is using two drives, each of which has two instances of the same data, a total of four instances of the same data. If one drive fails, the remaining one still has duplicated data.

But this profile seems not to exist.

Another impractical solution is to use RAID1 and execute a scrub after every write operation.

Yet another option is to have two drives, each partitioned in half, and use a RAID profile to have the four copies. But this is probably not good, for the same reasons why dup is preferable to RAID1 on two partitions of the same drive. I don't know those reasons, but there must be some, otherwise, people wouldn't have come up with dup as an alternative to using RAID1 on two partitions of the same drive.

Please make the effort to understand the question before posting an answer which doesn't help. Because data integrity is only verified when files are read, it seems that raid-1 alone provides no mitigation against data corruption since the last time the file wasread or scrub was run.

= As of 16 June I still have no satisfactory answer =

recommendation, user324831: Consider trying to understand the problem field before proposing a solution in your question! Cas' answer is pretty on point, and your question betrays a lack of problem modelling – modern enterprise SSDs have bit error probabilities in the order of 10⁻¹⁷ uncorrectable blocks per read bit. "bit rot" happening while simultaneously the other disk failed and before you can replace it – it doesn't, not unless you read many Petabytes before you replace the failed disk (which you don't). — Marcus Müller
– Marcus Müller, Commented Jun 12 at 17:57
you do; cas' answer addresses the problem, you just don't have proper monitoring it seems. A scrub should never be what you need here. — Marcus Müller
– Marcus Müller, Commented Jun 12 at 18:12
How can anything know that bits are rotting, without trying to read them (and verifying a checksum or second copy)? If you care about your data, you need backups (following the 3-2-1 rule), not device redundancy. Device redundancy reduces downtime, it’s not a comprehensive data protection strategy. The path you’re going down is basically quantifying risk of failure during given time windows; you can only reduce the time window (at a cost), not eliminate the risk. — Stephen Kitt
– Stephen Kitt, Commented Jun 12 at 18:22
@user324831 it's really neither a solution nor good, see cas' answer. I'll stop piling on this. — Marcus Müller
– Marcus Müller, Commented Jun 12 at 18:34

cas · Accepted Answer · 2025-06-12 16:25:09Z

You are over-thinking this and imagining a problem that doesn't exist.

If you have two drives, you can set up a btrfs RAID-1 filesystem. That gives you error detection (from btrfs) and error correction (due to btrfs being configured for RAID-1 so redundant). This is the protection against bitrot you're asking for. And protection against drive failure is provided by the fact that copies of your data are on all drives in the RAID-1 array, as long as at least one drive in the array survives it will be fine.

Every block written is hashed and if an error is detected in the data on one of the drives, it will be repaired from the good copy on the other drive.

Your data will be safe unless both drives die at the same time.

This is pretty much the point of using filesystems like btrfs or ZFS...although other features like sub-volumes, snapshots, send & receive, transparent compression, etc are very nice too.

You do not need multiple copies on each drive (in fact, you'd be better off with a three drive RAID-1 array to reduce the chance of all drives in the array dying at the same time). dup is certainly NOT preferable to RAID-1, it adds some redundancy to cope with bitrot (at the cost of halving your effective storage space, as does RAID-1...combined, you'd have only a quarter of the storage) but does not help at all if your drive dies.

And you do not need to run scrub after every write. All writes are automatically synced to all drives in a RAID-1 array.

If one of the drives die, replace it ASAP - until it is replaced, there will be no redundancy of the data and no ability to correct errors. To be really safe, keep a spare drive so you can swap it in immediately (although it's debatable whether a cold-spare is better than just having a three-drive RAID-1. A cold spare will be effectively brand-new when swapped in, while a third drive in the array will suffer normal wear and tear...but will increase redundancy and improve read performance).

Note that, as with any other form of RAID or RAID-like filesystem such as btrfs, RAID is NOT A SUBSTITUTE FOR REGULAR BACKUPS.

BTW, using RAID-1 with two partitions on the same drive is a really bad idea. You still lose half of your available space and btrfs is able to both detect and correct errors because there's a redundant copy, but it doesn't protect against drive failure. And it trashes write performance because each block of data needs to be written twice to the same drive, on sectors located far from each other - the drive will be constantly thrashing, seeking from one end of the drive to another (of course, this would only be a severe problem on HDD, not with SSD or NVME drives). And you wouldn't get any read performance benefit, either as the RAID-1 wouldn't be able to spread the read load over two (or more) drives.

Thanks for the answer, but I think it only works when I detect the corruption before one of the two disks fails: not the situation of my question because I'd need to read the corrupted data before one of the two disks fails. When you say dup is not preferable to raid-1 I think you didn't see that I meant raid-1 on two partitions of the same drive. Finally, I don't think it's possible to combine dup and raid-1, at least not by standard means... — user324831
– user324831, Commented Jun 12 at 16:32
YOU don't need to detect corruption - that's the filesystem's job, it does it automatically when a file is read (which is why daily scrubs are a good idea). Your job is to notice when a drive fails and replace it ASAP. and, yes, RAID-1 and dup can be used together. I don't see much point in doing so (except maybe for a sub-volume containing highly valuable data), but it's possible. — cas
– cas, Commented Jun 12 at 16:45
Thanks for your comment but because corruption is only detected on files when they are read, data integrity is not preserved at the point in time when one of the drives fail, hence my question. — user324831
– user324831, Commented Jun 12 at 16:47

Stack Exchange Network

How to protect against both bit rot and device failure with Btrfs

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to protect against both bit rot and device failure with Btrfs

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions