Bug 203573 - bcache consistently corrupts filesystem beyond repair
Summary: bcache consistently corrupts filesystem beyond repair
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: io_other
URL: https://bugzilla.redhat.com/show_bug....
Depends on:
Reported: 2019-05-11 08:01 UTC by Rolf Fokkens
Modified: 2019-07-03 20:12 UTC (History)
21 users (show)

See Also:
Kernel Version: 5.0.11
Regression: No
Bisected commit-id:

preprocessed output of bch_btree_insert_key() (3.87 KB, text/plain)
2019-06-07 09:46 UTC, Lukáš Krejčí
quick patch for the issue (1.75 KB, patch)
2019-06-07 13:11 UTC, Lukáš Krejčí
Details | Diff

Description Rolf Fokkens 2019-05-11 08:01:35 UTC
Upgrading a Fedora system using bcache to Fedora 30 results in massive filestem corruption. This is caused specifically caused by kernels built for Fedora 30, while the same kernels built for Fedora 29 work just fine.

A notable difference between Fedora 29 and Fedora 30 is the fact that Fedora 29 uses gcc8 and Fedora uses gcc9.
Comment 1 Rolf Fokkens 2019-05-11 08:03:20 UTC
More details can be found in the linked Fedora bug report.
Comment 2 Pierre Juhen 2019-05-12 06:57:27 UTC
Same issue for me.
Comment 3 Fredrik Chabot 2019-05-14 08:10:08 UTC
I can confirm this issue
Comment 4 Pierre Juhen 2019-05-14 09:56:14 UTC
Please look at https://bugzilla.redhat.com/show_bug.cgi?id=1708315 to see the progresses that have been made (Thahks to Rolf).

In summary :

The problem occurs only if the kernel has been compiled with gcc 9.x.y.

The problem doesn't occur in "none" mode, but in writethrough or in writeback mode.

This shows that the frontend metadata is not read properly (driver returns a wrong block if the block data is present in the frontend).
Comment 5 László Szalma 2019-05-24 07:34:09 UTC
I confirm the issue kernel 5.1.3 built with gcc 9.1.0 causes data loss when bcache has cache device. Without attached cache device it works.
Comment 6 Lukáš Krejčí 2019-06-07 09:46:21 UTC
Created attachment 283141 [details]
preprocessed output of bch_btree_insert_key()

I believe I may have identified the issue.

The problematic file for me was bset.o (when bset.o was compiled by GCC 8.3.0 and the rest by GCC 9.1.0, the issue disappeared).
The bug was present even when bset.o was compiled by GCC 9.1.0 with -O0 (I had to comment out pr_debug() in btree_mergesort() to make it compile).

Looking at the assembly of bset.o, the only obvious changes were in btree_mergesort() and bch_btree_insert_key().
I then generated the assembly output of bset.c by both GCC versions.
When the code of bch_btree_insert_key() generated by GCC 9 was replaced by the code from GCC 8, the issue disappeared again.

Further examination of the assembly revealed that the uninitialized space of `iter` is now shared by another temporary (whereas previously they occupied two separate locations, at least in the assembly compiled with -O0).
As far as I can see, the real issue is in the PRECEDING_KEY() macro; it returns an address of a temporary which goes out of scope at the end of the compound statement expresion (see the attached preprocessed version of bch_btree_insert_key()).

As a result, the arguments `iter` and `search` of bch_btree_iter_init() point to overlapping memory locations. When bch_btree_iter_init() writes to `iter`, `search` is clobbered.

I did only a very limited testing in a virtual machine, with the cache and the backing device both being loop devices backed by files on tmpfs.
The actual test consisted of formatting /dev/bcache0 to Btrfs and mounting it:
# mkfs.btrfs -f /dev/bcache0
# mount /dev/bcache0 /mnt     # <-- The failure was here.
# umount /mnt

There wasn't any success reproducing it with the fio script provided in Red Hat's Bugzilla.
Comment 7 Pierre Juhen 2019-06-07 10:20:38 UTC
Congratulations Lukãs,

Could you send a pacth, such a way we could test it ?

Thank you
Comment 8 Lukáš Krejčí 2019-06-07 13:11:48 UTC
Created attachment 283145 [details]
quick patch for the issue

I was just initializing `iter` to zero in bch_btree_insert_key() to force the compiler to not reuse the memory location (it worked for -O0) like so:
  struct btree_iter iter = {0};

Anyway, you can test the attached patch I made just now.
Comment 9 Hristo Venev 2019-06-07 16:41:02 UTC
This doesn't remove the problem of keeping a pointer to a temporary for too long - the pointer on line 444 is dangling when used on line 446.

I'm working on a patch that removes all uses of &KEY(...) because I find the exact rules for when it is OK confusing.
Comment 10 Lukáš Krejčí 2019-06-07 17:06:57 UTC
I don't think that's the case, the KEY() macro does not use compound statement expression (it's not in its own block, this should be clearly visible from the preprocessed source), so the temporary from it should last until the end of the inline function.
Comment 11 Hristo Venev 2019-06-07 21:38:28 UTC
You are right, compound literals last until the end of the block they are created in.

Anyway, I've tested your patch, and things looked fine. I will do more testing tomorrow.
Comment 12 Coly Li 2019-06-08 07:36:10 UTC
Here I post my reply on linux-bcache@mailing list to here. Thanks to Rolf Fokkens <rolf@rolffokkens.nl> to notice me for this bugzilla entry.

Hi folks,

From the above bugzilla.kernel.org links, it seems people are all on the
same/correct track :-)

I don't analyze the assembly code, because I can stably reproduce the
problem: two adjacent keys don't merge. And I find the problem is from
macro PRECEDING_KEY(), which misleading bch_btree_insert_key() when it
calls bch_btree_iter_init(). Misleading means returned 'm' always
indicates the inserting key is the first key in the btree node because
prev in the while-loop is NULL.

For the second adjacent key, obviously prev point should not be NULL, so
the suspicious point is PRECEDING_KEY(), its code is,
437 #define PRECEDING_KEY(_k)                                       \
438 ({                                                              \
439         struct bkey *_ret = NULL;                               \
440                                                                 \
441         if (KEY_INODE(_k) || KEY_OFFSET(_k)) {                  \
442                 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0);  \
443                                                                 \
444                 if (!_ret->low)                                 \
445                         _ret->high--;                           \
446                 _ret->low--;                                    \
447         }                                                       \
448                                                                 \
449         _ret;                                                   \
450 })

The problem is at line 442, KEY() announced a on-stack temporary
variable in struct bkey {high = KEY_INODE(_k), low = KEY_OFFSET(_k), ptr
= NULL}. But its life range is in line 442-447. When address of this
on-stack variable returns from PRECEDING_KEY() and sent into
bch_btree_iter_init(), this stack space of this on-stack variable is
overwritten by stackframe of bch_btree_iter_init().

To confirm the above suspicion, I modify bch_btree_insert_key() to not
use PRECEDING_KEY(), and store the preceding key in a local variable of
bch_btree_insert_key() like this,

@@ -884,27 +894,78 @@ unsigned int bch_btree_insert_key(struct
btree_keys *b, struct bkey *k,
        struct bset *i = bset_tree_last(b)->data;
        struct bkey *m, *prev = NULL;
        struct btree_iter iter;
+       struct bkey preceding_key = ZERO_KEY;
+       struct bkey *preceding_key_p = NULL;

        BUG_ON(b->ops->is_extents && !KEY_SIZE(k));

-       m = bch_btree_iter_init(b, &iter, b->ops->is_extents
-                               ? PRECEDING_KEY(&START_KEY(k))
-                               : PRECEDING_KEY(k));
+       if (b->ops->is_extents) {
+               struct bkey tmp_k;
+               tmp_k = START_KEY(k);
+               if (KEY_INODE(&tmp_k) || KEY_OFFSET(&tmp_k)) {
+                       preceding_key = KEY(KEY_INODE(&tmp_k),
KEY_OFFSET(&tmp_k), 0);
+                       if (!preceding_key.low)
+                               preceding_key.high--;
+                       preceding_key.low--;
+               }
+               preceding_key_p = &preceding_key;
+       } else {
+               if (KEY_INODE(k) || KEY_OFFSET(k)) {
+                       preceding_key = KEY(KEY_INODE(k), KEY_OFFSET(k), 0);
+                       if (!preceding_key.low)
+                               preceding_key.high--;
+                       preceding_key.low--;
+               }
+               preceding_key_p = &preceding_key;
+       }
+       m = bch_btree_iter_init(b, &iter, preceding_key_p);

Now I can see the preceding key is reported and adjacent keys can be
merged. So far my simple fio testing works fine, but I need to go
through all locations where KEY() and PROCEDING_KEY() are used and find
a proper way to fix.

As I guessed this is a bcache bug and triggered by gcc9. I have no clear
idea why it works fine before (maybe depends on some compiling option
?), but definitely this code should be fixed.



Coly Li
Comment 13 Coly Li 2019-06-08 07:38:14 UTC
The above change is only a POC, not a final patch. Now I am thinking of how to fix it in a more elegant way, once done I will post patches to linux-bcache@vger.kernel.org and linux-block@vger.kernel.org.


Coly Li
Comment 14 Coly Li 2019-07-03 04:48:27 UTC
This bug is fixed and commit is merged in Linux v5.2 by the following commit,

commit 31b90956b124240aa8c63250243ae1a53585c5e2
Author: Coly Li <colyli@suse.de>
Date:   Mon Jun 10 06:13:34 2019 +0800

    bcache: fix stack corruption by PRECEDING_KEY()

It seems I am not able to set the bug status to fixed, so I leave message here to indicate this one can be closed as fixed.


Coly Li
Comment 15 Rolf Fokkens 2019-07-03 06:54:34 UTC
As the reporter I apparently am able to change the status.
Comment 16 Fredrik Chabot 2019-07-03 07:27:01 UTC
Will there be an updated installer image?
Comment 17 Pierre Juhen 2019-07-03 16:07:30 UTC
I don't know the answer.

However, there is a workaround.

1/ Disable caching on your bcache device(s) (cache_mode = none)

2/ Upgrade your system

3/ Update your installation

4/ A soon as your kernel version is at least 5.1.12, re-enable caching (cache_mode = writethrough or writeback)

5/ get rid of any kernel version  that can contain the issue (i.e. kernel version <= 5.1.11 compiled with gcc9)

Comment 18 Tomasz Torcz 2019-07-03 20:12:56 UTC
Not official, but there's a SIG creating updated images:
Current has 5.1.15-200 kernel - https://dl.fedoraproject.org/pub/alt/live-respins/

Note You need to log in before you can comment on or make changes to this bug.