Bug 197377

Summary: [BCache] Enabling discard broke caching device
Product: IO/Storage Reporter: Clemens Eisserer (linuxhippy)
Component: OtherAssignee: io_other
Status: NEW ---    
Severity: normal CC: colyli, sidney
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.13.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: screenshot of syslog

Description Clemens Eisserer 2017-10-24 19:24:05 UTC
After using bcache daily for ~6 months I decided to enable discard (echo 1 > discard) on the caching device (the filesystem used on bache itself was btrfs with discard disabled).

At reboot bcache failed with:
prio_read() bad csum reading priorities
prio_read() bad magic reading priorities

The caching SSD is: "ata2.00: ATA-8: OCZ-OCTANE, 1.14.1, max UDMA/133" which was completly stable for years with/without TRIM, and it does not show any signs of issues (SMART is clean, btrfs+discard works stable after re-installing fedora).
Comment 1 Clemens Eisserer 2017-11-06 10:02:40 UTC
Created attachment 260519 [details]
screenshot of syslog
Comment 2 Coly Li 2020-02-15 17:05:02 UTC
Does this problem still show up in Linux v5.5 or v5.6-rc ?

Thanks.

Coly Li
Comment 3 Sidney San Martín 2021-03-04 19:17:37 UTC
I just ran into this on 5.11. There are also mailing list threads going back years with people running into this issue:

https://www.spinics.net/lists/linux-bcache/msg02712.html
https://www.spinics.net/lists/linux-bcache/msg02954.html
https://www.spinics.net/lists/linux-bcache/msg04668.html
https://www.spinics.net/lists/linux-bcache/msg05279.html

I may be facing data loss, or at least a lengthy recovery. At the very least can a huge warning about discard be added to the docs?
Comment 4 Coly Li 2021-03-05 11:58:36 UTC
(In reply to Sidney San Martín from comment #3)
> I just ran into this on 5.11. There are also mailing list threads going back
> years with people running into this issue:
> 
> https://www.spinics.net/lists/linux-bcache/msg02712.html
> https://www.spinics.net/lists/linux-bcache/msg02954.html
> https://www.spinics.net/lists/linux-bcache/msg04668.html
> https://www.spinics.net/lists/linux-bcache/msg05279.html
> 
> I may be facing data loss, or at least a lengthy recovery. At the very least
> can a huge warning about discard be added to the docs?

It seems the problem is from bcache journal discard. I will post a fix and Cc you as Reported-by.

Thank you for the follow up and the above hint :-)

Coly Li
Comment 5 Sidney San Martín 2021-03-06 00:49:19 UTC
(In reply to Coly Li from comment #4)
> It seems the problem is from bcache journal discard. I will post a fix and
> Cc you as Reported-by.
> 
> Thank you for the follow up and the above hint :-)
> 
> Coly Li

Thanks Coly. Do you have any thoughts on how I or someone else in this state can approach recovery? Even if it involves manual work tweaking code or data structures, I'd really like to avoid having to recover everything from backup.

In my case, since BTRFS has some room to recover from failure, I just want to get things as clean as possible before I try.
Comment 6 Coly Li 2021-03-08 16:39:28 UTC
The problem is a discarded journal set not identified as useless. I feel this issue was introduced from the very early version when journal discard was implemented. Your data should be still intact, the checksum failure is probably from a discarded journal set (not a really corrupted one), maybe (it is really risky) you may try to comment out the checksum checking of the priority, then the cache set should run up if you are lucky. Then try to attach the cache from backing device, and try to recovery your data without attached cache device.

The idea of the above steps is to ignore the unmatched checksum for discarded journal set, but this is only my fast brain storm, it is really risky to loose data (maybe more) and I don't try it before.

Coly Li
Comment 7 Coly Li 2021-03-08 16:41:06 UTC
(In reply to Coly Li from comment #6)
> The problem is a discarded journal set not identified as useless. I feel
> this issue was introduced from the very early version when journal discard
> was implemented. Your data should be still intact, the checksum failure is
> probably from a discarded journal set (not a really corrupted one), maybe
> (it is really risky) you may try to comment out the checksum checking of the
> priority, then the cache set should run up if you are lucky. Then try to
> attach the cache from backing device, and try to recovery your data without
> attached cache device.

The last sentence should be: Then try to detach the cache from backing device, and try to recovery your data without attaching cache device.
Comment 8 Coly Li 2021-03-09 05:17:28 UTC
(In reply to Coly Li from comment #7)
> (In reply to Coly Li from comment #6)
> > The problem is a discarded journal set not identified as useless. I feel
> > this issue was introduced from the very early version when journal discard
> > was implemented. Your data should be still intact, the checksum failure is
> > probably from a discarded journal set (not a really corrupted one), maybe
> > (it is really risky) you may try to comment out the checksum checking of
> the
> > priority, then the cache set should run up if you are lucky. Then try to

More accurate description is: if the checksum checking fails, just skip it and not trigger a fault. The checksum check still necessary.

Coly Li
Comment 9 Sidney San Martín 2021-03-10 07:05:27 UTC
(In reply to Coly Li from comment #6)
> The problem is a discarded journal set not identified as useless. I feel
> this issue was introduced from the very early version when journal discard
> was implemented. Your data should be still intact, the checksum failure is
> probably from a discarded journal set (not a really corrupted one), maybe
> (it is really risky) you may try to comment out the checksum checking of the
> priority, then the cache set should run up if you are lucky. Then try to
> attach the cache from backing device, and try to recovery your data without
> attached cache device.
> 
> The idea of the above steps is to ignore the unmatched checksum for
> discarded journal set, but this is only my fast brain storm, it is really
> risky to loose data (maybe more) and I don't try it before.
> 
> Coly Li

Thanks Coly. I did try this, and I'm hitting a new failure (also enabled debug logging):

  kernel: bcache: read_super() read sb version 3, flags 3, seq 6, journal size 256
  kernel: bcache: bch_journal_read() 256 journal buckets
  kernel: bcache: journal_read_bucket() reading 0
  kernel: bcache: bch_journal_read() starting binary search, l 0 r 256
  kernel: bcache: journal_read_bucket() reading 128
  kernel: bcache: journal_read_bucket() 128: bad magic
  kernel: bcache: journal_read_bucket() reading 64
  kernel: bcache: journal_read_bucket() reading 96
  kernel: bcache: journal_read_bucket() reading 112
  kernel: bcache: journal_read_bucket() 112: bad magic
  kernel: bcache: journal_read_bucket() reading 104
  kernel: bcache: journal_read_bucket() 104: bad magic
  kernel: bcache: journal_read_bucket() reading 100
  kernel: bcache: journal_read_bucket() reading 102
  kernel: bcache: journal_read_bucket() 102: bad magic
  kernel: bcache: journal_read_bucket() reading 101
  kernel: bcache: journal_read_bucket() 101: bad magic
  kernel: bcache: bch_journal_read() finishing up: m 101 njournal_buckets 256
  kernel: bcache: journal_read_bucket() reading 99
  kernel: bcache: run_cache_set() btree_journal_read() done
  kernel: bcache: bch_cache_set_error() error on 6432e656-f28e-49f8-943d-6307d42d37e9: unsupported bset version at bucket 108857, block 0, 79054670 keys, disabling caching

Any ideas on what to poke next? FYI, I'm running the whole set of drives as dm snapshots with files as the COW devices, so these attempts are all safe and nondestructive.

I'm continuing to explore the bcache source but ideas would be much appreciated.
Comment 10 Sidney San Martín 2021-03-18 03:17:25 UTC
I have the bcache back online (still using COW snapshots, so the original devices are untouched).

To summarize a lot of debugging, I determined that bch_journal_read() is choosing the wrong bucket. The buckets on disk are laid out like this:

  [lots of valid, but old, jsets]
  [a bunch of completely empty buckets]
  [newest bucket, contains its own last_seq, with a bunch of valid jsets and one final garbage jset]
  [lots of valid, but old, jsets]

If I hard-code that bucket, like this:

  diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
  index aefbdb7e0..838494dcf 100644
  --- a/drivers/md/bcache/journal.c
  +++ b/drivers/md/bcache/journal.c
  @@ -193,6 +193,11 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
     * Read journal buckets ordered by golden ratio hash to quickly
     * find a sequence of buckets with valid journal entries
     */
  +
  +  l = 206;
  +  if (read_bucket(l))
  +    goto yolofs;
  +
    for (i = 0; i < ca->sb.njournal_buckets; i++) {
      /*
       * We must try the index l with ZERO first for
  @@ -265,6 +270,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
        break;
    }

  +yolofs:
    seq = 0;

    for (i = 0; i < ca->sb.njournal_buckets; i++)

…then the cache comes up successfully with no other code changes. I can turn off writeback and get all of the backing devices into a clean state (by waiting or by setting them to writethrough). However, even if I turn off discard, the cache does not seem to end up back in a state where an un-modified module can load it. I'm curious if there's a way I can prod the module to rewrite things such that the bcache is in a good state again, but I'll *probably* want to find a way to blow away the existing bcache, instead, and start fresh with writeback and discard turned off.

(a) does this sound legit?
(b) what can I preserve from my bcache that would be useful for you and the other devs?