After using bcache daily for ~6 months I decided to enable discard (echo 1 > discard) on the caching device (the filesystem used on bache itself was btrfs with discard disabled). At reboot bcache failed with: prio_read() bad csum reading priorities prio_read() bad magic reading priorities The caching SSD is: "ata2.00: ATA-8: OCZ-OCTANE, 1.14.1, max UDMA/133" which was completly stable for years with/without TRIM, and it does not show any signs of issues (SMART is clean, btrfs+discard works stable after re-installing fedora).
Created attachment 260519 [details] screenshot of syslog
Does this problem still show up in Linux v5.5 or v5.6-rc ? Thanks. Coly Li
I just ran into this on 5.11. There are also mailing list threads going back years with people running into this issue: https://www.spinics.net/lists/linux-bcache/msg02712.html https://www.spinics.net/lists/linux-bcache/msg02954.html https://www.spinics.net/lists/linux-bcache/msg04668.html https://www.spinics.net/lists/linux-bcache/msg05279.html I may be facing data loss, or at least a lengthy recovery. At the very least can a huge warning about discard be added to the docs?
(In reply to Sidney San Martín from comment #3) > I just ran into this on 5.11. There are also mailing list threads going back > years with people running into this issue: > > https://www.spinics.net/lists/linux-bcache/msg02712.html > https://www.spinics.net/lists/linux-bcache/msg02954.html > https://www.spinics.net/lists/linux-bcache/msg04668.html > https://www.spinics.net/lists/linux-bcache/msg05279.html > > I may be facing data loss, or at least a lengthy recovery. At the very least > can a huge warning about discard be added to the docs? It seems the problem is from bcache journal discard. I will post a fix and Cc you as Reported-by. Thank you for the follow up and the above hint :-) Coly Li
(In reply to Coly Li from comment #4) > It seems the problem is from bcache journal discard. I will post a fix and > Cc you as Reported-by. > > Thank you for the follow up and the above hint :-) > > Coly Li Thanks Coly. Do you have any thoughts on how I or someone else in this state can approach recovery? Even if it involves manual work tweaking code or data structures, I'd really like to avoid having to recover everything from backup. In my case, since BTRFS has some room to recover from failure, I just want to get things as clean as possible before I try.
The problem is a discarded journal set not identified as useless. I feel this issue was introduced from the very early version when journal discard was implemented. Your data should be still intact, the checksum failure is probably from a discarded journal set (not a really corrupted one), maybe (it is really risky) you may try to comment out the checksum checking of the priority, then the cache set should run up if you are lucky. Then try to attach the cache from backing device, and try to recovery your data without attached cache device. The idea of the above steps is to ignore the unmatched checksum for discarded journal set, but this is only my fast brain storm, it is really risky to loose data (maybe more) and I don't try it before. Coly Li
(In reply to Coly Li from comment #6) > The problem is a discarded journal set not identified as useless. I feel > this issue was introduced from the very early version when journal discard > was implemented. Your data should be still intact, the checksum failure is > probably from a discarded journal set (not a really corrupted one), maybe > (it is really risky) you may try to comment out the checksum checking of the > priority, then the cache set should run up if you are lucky. Then try to > attach the cache from backing device, and try to recovery your data without > attached cache device. The last sentence should be: Then try to detach the cache from backing device, and try to recovery your data without attaching cache device.
(In reply to Coly Li from comment #7) > (In reply to Coly Li from comment #6) > > The problem is a discarded journal set not identified as useless. I feel > > this issue was introduced from the very early version when journal discard > > was implemented. Your data should be still intact, the checksum failure is > > probably from a discarded journal set (not a really corrupted one), maybe > > (it is really risky) you may try to comment out the checksum checking of > the > > priority, then the cache set should run up if you are lucky. Then try to More accurate description is: if the checksum checking fails, just skip it and not trigger a fault. The checksum check still necessary. Coly Li
(In reply to Coly Li from comment #6) > The problem is a discarded journal set not identified as useless. I feel > this issue was introduced from the very early version when journal discard > was implemented. Your data should be still intact, the checksum failure is > probably from a discarded journal set (not a really corrupted one), maybe > (it is really risky) you may try to comment out the checksum checking of the > priority, then the cache set should run up if you are lucky. Then try to > attach the cache from backing device, and try to recovery your data without > attached cache device. > > The idea of the above steps is to ignore the unmatched checksum for > discarded journal set, but this is only my fast brain storm, it is really > risky to loose data (maybe more) and I don't try it before. > > Coly Li Thanks Coly. I did try this, and I'm hitting a new failure (also enabled debug logging): kernel: bcache: read_super() read sb version 3, flags 3, seq 6, journal size 256 kernel: bcache: bch_journal_read() 256 journal buckets kernel: bcache: journal_read_bucket() reading 0 kernel: bcache: bch_journal_read() starting binary search, l 0 r 256 kernel: bcache: journal_read_bucket() reading 128 kernel: bcache: journal_read_bucket() 128: bad magic kernel: bcache: journal_read_bucket() reading 64 kernel: bcache: journal_read_bucket() reading 96 kernel: bcache: journal_read_bucket() reading 112 kernel: bcache: journal_read_bucket() 112: bad magic kernel: bcache: journal_read_bucket() reading 104 kernel: bcache: journal_read_bucket() 104: bad magic kernel: bcache: journal_read_bucket() reading 100 kernel: bcache: journal_read_bucket() reading 102 kernel: bcache: journal_read_bucket() 102: bad magic kernel: bcache: journal_read_bucket() reading 101 kernel: bcache: journal_read_bucket() 101: bad magic kernel: bcache: bch_journal_read() finishing up: m 101 njournal_buckets 256 kernel: bcache: journal_read_bucket() reading 99 kernel: bcache: run_cache_set() btree_journal_read() done kernel: bcache: bch_cache_set_error() error on 6432e656-f28e-49f8-943d-6307d42d37e9: unsupported bset version at bucket 108857, block 0, 79054670 keys, disabling caching Any ideas on what to poke next? FYI, I'm running the whole set of drives as dm snapshots with files as the COW devices, so these attempts are all safe and nondestructive. I'm continuing to explore the bcache source but ideas would be much appreciated.
I have the bcache back online (still using COW snapshots, so the original devices are untouched). To summarize a lot of debugging, I determined that bch_journal_read() is choosing the wrong bucket. The buckets on disk are laid out like this: [lots of valid, but old, jsets] [a bunch of completely empty buckets] [newest bucket, contains its own last_seq, with a bunch of valid jsets and one final garbage jset] [lots of valid, but old, jsets] If I hard-code that bucket, like this: diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index aefbdb7e0..838494dcf 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -193,6 +193,11 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) * Read journal buckets ordered by golden ratio hash to quickly * find a sequence of buckets with valid journal entries */ + + l = 206; + if (read_bucket(l)) + goto yolofs; + for (i = 0; i < ca->sb.njournal_buckets; i++) { /* * We must try the index l with ZERO first for @@ -265,6 +270,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) break; } +yolofs: seq = 0; for (i = 0; i < ca->sb.njournal_buckets; i++) …then the cache comes up successfully with no other code changes. I can turn off writeback and get all of the backing devices into a clean state (by waiting or by setting them to writethrough). However, even if I turn off discard, the cache does not seem to end up back in a state where an un-modified module can load it. I'm curious if there's a way I can prod the module to rewrite things such that the bcache is in a good state again, but I'll *probably* want to find a way to blow away the existing bcache, instead, and start fresh with writeback and discard turned off. (a) does this sound legit? (b) what can I preserve from my bcache that would be useful for you and the other devs?