Bug 213181

Summary: kernel BUG at block/bio.c:52!
Product: IO/Storage Reporter: Jens Stutte (jens)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: NEW ---    
Severity: high CC: arne_woerner, guoqing.jiang, ian, jens, troytjh98
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bugs.archlinux.org/task/70992
Kernel Version: 5.12.5-arch1-1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel BUG at block/bio.c:52!
Still crash with 5.14.7-arch1-1
Check max_sectors always if there is a bitmap.
patch that allow write behind only if write mostly is set

Description Jens Stutte 2021-05-23 12:10:38 UTC
Created attachment 296953 [details]
kernel BUG at block/bio.c:52!

The [patch `block: use bi_max_vecs to find the bvec pool`](https://www.spinics.net/lists/linux-block/msg64802.html) causes/reveals a kernel bug (see attachment).

I assume, there has always been something wrong somewhere in md that has been gently mooted before. I am unsure if this is really such serious to justify a BUG() here.

I have a RAID 1 setup with two 2TB SSDs here. I am happy to provide any further information, of course.
Comment 2 Arne Woerner 2021-06-03 21:39:32 UTC
*** Bug 213303 has been marked as a duplicate of this bug. ***
Comment 3 Troy 2021-06-05 16:31:40 UTC
*** Bug 213249 has been marked as a duplicate of this bug. ***
Comment 4 Troy 2021-06-05 22:53:24 UTC
This bug appears to be mentioned here:
https://www.spinics.net/lists/linux-bcache/msg10077.html

The provided a temporary patch further in the thread:
https://www.spinics.net/lists/linux-bcache/msg10108.html
Comment 5 Arne Woerner 2021-06-12 08:50:36 UTC
the patch seems to work fine... -arne
Comment 6 Arne Woerner 2021-06-17 07:16:28 UTC
5.12.11 works fine since hours... -arne
Comment 7 Jens Stutte 2021-06-18 06:27:41 UTC
Hi Arne, did you test with a bcache setup or plain md? I see this patch touches only bcache related files, and I am not using bcache here at all (and I'd rather want to test on that machine only if I can hope that it works :-) ). Thanks!
Comment 8 Arne Woerner 2021-06-18 17:25:47 UTC
Hi Jens, I only tested it with bcache.
the changelog doesnt say anything about md. right?
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.12.11
so md might still be affected.
-arne
Comment 9 Jens Stutte 2021-07-08 16:45:05 UTC
The last comment from Ilgiz on https://bugs.archlinux.org/task/70992 indicates that it is not solved yet for md users.
Comment 10 Guoqing Jiang 2021-07-21 02:20:27 UTC
Does the raid1 mentioned in archlinux's bug has write-behind device?

Anyway, could you try the small change? Thanks.

--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1456,6 +1456,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
                goto retry_write;
        }
 
+       max_sectors = min_t(uint32_t, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
        if (max_sectors < bio_sectors(bio)) {
                struct bio *split = bio_split(bio, max_sectors,
                                              GFP_NOIO, &conf->bio_split);
Comment 11 Jens Stutte 2021-07-22 06:05:57 UTC
Hi Guoging, yes, the devices in this array seem to have Write-nehind enabled:

# mdadm -X /dev/sdb1
        Filename : /dev/sdb1
           Magic : 6d746962
         Version : 4
            UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
          Events : 32707
  Events Cleared : 32707
           State : OK
       Chunksize : 64 MB
          Daemon : 5s flush period
      Write Mode : Allow write behind, max 4096
       Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
          Bitmap : 29807 bits (chunks), 0 dirty (0.0%)
# mdadm -X /dev/sdc1
        Filename : /dev/sdc1
           Magic : 6d746962
         Version : 4
            UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
          Events : 32707
  Events Cleared : 32707
           State : OK
       Chunksize : 64 MB
          Daemon : 5s flush period
      Write Mode : Allow write behind, max 4096
       Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
          Bitmap : 29807 bits (chunks), 0 dirty (0.0%)

I cannot safely try this patch any time soon, as I am away from the machine and I need it up and running.
Comment 12 Jens Stutte 2021-08-10 11:32:21 UTC
Trying the patch above leads to a compile error, as PAGE_SECTORS is not defined. I can find its definition inside "bcache/util.h", but it feels kind of wrong to include this inside of raid1.c ?
Comment 13 Guoqing Jiang 2021-08-11 00:40:45 UTC
That is why I sent the patch.

https://lore.kernel.org/linux-block/20210721025315.1729118-1-guoqing.jiang@linux.dev/

Or you can just define them in raid1.c temporarily to just verify the change.
Comment 14 Jens Stutte 2021-08-11 16:03:26 UTC
OK, I just copied over the define inside raid1.c for testing. So far the machine runs well with the kernel (2h uptime for now, compiling firefox seems to work). I just found one stack trace via journalctl, but I think it is unrelated:

Process 1039 (baloo_file) of user 1000 dumped core.
                                            
Stack trace of thread 1918:
#0  0x00007fd34828f3a8 n/a (liblmdb.so + 0x43a8)
#1  0x00007fd348291efe n/a (liblmdb.so + 0x6efe)
#2  0x00007fd348292644 n/a (liblmdb.so + 0x7644)
#3  0x00007fd348292c50 mdb_get (liblmdb.so + 0x7c50)
#4  0x00007fd3497a466a _ZN5Baloo10DocumentDB3getEy (libKF5BalooEngine.so.5 + 0xd66a)

I'll keep it running for a while but without the patch it crashed immediately and repeatedly so I am kind of optimistic that this helps.
Comment 15 Guoqing Jiang 2021-08-12 02:41:58 UTC
Okay, let's wait, I will send a formal patch with your reported-by and tested-by once you confirm the problem is not happened. Thanks for test it!
Comment 16 Jens Stutte 2021-08-12 10:37:25 UTC
FWIW, it is still running well.
Comment 17 Jens Stutte 2021-08-21 18:57:29 UTC
Hi, the custom kernel is still running without problems ever since. If I understand the discussion around the patch in the mailing list right, it should bow be reviewed as correct fix, too?
Thank you
Comment 18 Jens Stutte 2021-08-21 19:00:22 UTC
BTW: I was wondering how to disable write-behind as it seems to be a remainder of an old configuration (when I used an SSD and a spinning disk), but there seems to be no easy way. Re-adding the device with --readwrite does not change anything, AFAICS.
Comment 19 Guoqing Jiang 2021-08-23 07:48:11 UTC
(In reply to Jens Stutte from comment #17)
> Hi, the custom kernel is still running without problems ever since. If I
> understand the discussion around the patch in the mailing list right, it
> should bow be reviewed as correct fix, too?
> Thank you

Yes, I think so
Comment 20 Guoqing Jiang 2021-08-23 07:50:07 UTC
(In reply to Jens Stutte from comment #18)
> BTW: I was wondering how to disable write-behind as it seems to be a
> remainder of an old configuration (when I used an SSD and a spinning disk),
> but there seems to be no easy way. Re-adding the device with --readwrite
> does not change anything, AFAICS.

Hi, I'd suggest to either

1. file another bug, then describe the issue in that bug

or

2. report it to linux-raid list.

Thanks.
Comment 21 Jens Stutte 2021-09-24 08:09:42 UTC
Created attachment 298947 [details]
Still crash with 5.14.7-arch1-1

Hello,

I just tried kernel 5.14.7-arch1-1 which is supposed to contain the patch, IIUC. Unfortunately the problem persists, see attached log.
With the custom kernel I built from the initial tentative patch (that looks a bit different from what went into the kernel sources) it still works, instead.
Comment 22 Jens Stutte 2021-09-24 11:37:41 UTC
AFAICS the condition that checks if alloc_behind_master_bio is called does not contain the pre-calculated write_behind flag we use in the patch to check if we should limit max_sectors (together with a nullptr check on bitmap). I am not sure, how this is possible, but I would then assume we should just check for bitmap.
Comment 23 Jens Stutte 2021-09-24 13:54:23 UTC
Created attachment 298951 [details]
Check max_sectors always if there is a bitmap.

With the attached patch it returns to work.
The condition used to decide if we need to split differed from the condition used to decide when to alloc.
This patch just splits always if there is a bitmap and the max_sectors is too big.
Not sure if it is the most optimized way, but it works. (Actually I first tried with a variant that would only alloc if the WriteMostly bit is set for that specific drive, but that seemed to have negative effects on the performance, though I did no measurements)
Comment 24 Guoqing Jiang 2021-09-25 09:25:52 UTC
Hmm, it means write behind io happens even without write-mostly device ..., perhaps somewhere is wrong.

Could you post output of these cmds?

1. cat /proc/mdstat
2. cat /sys/block/md0/md/bitmap/{backlog,max_backlog_used} - replace md0 with the actual md node.
Comment 25 Jens Stutte 2021-09-25 09:40:35 UTC
Sure:

root@vdr jens]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[3] sdc1[2]
      1953381440 blocks super 1.2 [2/2] [UU]
      bitmap: 1/15 pages [4KB], 65536KB chunk

unused devices: <none>

[root@vdr jens]# cat /sys/block/md0/md/bitmap/backlog 
4096

[root@vdr jens]# cat /sys/block/md0/md/bitmap/max_backlog_used 
4096

and for completeness:

[root@vdr jens]# mdadm -X /dev/sdb1
        Filename : /dev/sdb1
           Magic : 6d746962
         Version : 4
            UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
          Events : 32804
  Events Cleared : 32804
           State : OK
       Chunksize : 64 MB
          Daemon : 5s flush period
      Write Mode : Allow write behind, max 4096
       Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
          Bitmap : 29807 bits (chunks), 0 dirty (0.0%)

[root@vdr jens]# mdadm -X /dev/sdc1
        Filename : /dev/sdc1
           Magic : 6d746962
         Version : 4
            UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
          Events : 32804
  Events Cleared : 32804
           State : OK
       Chunksize : 64 MB
          Daemon : 5s flush period
      Write Mode : Allow write behind, max 4096
       Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
          Bitmap : 29807 bits (chunks), 0 dirty (0.0%)
Comment 26 Guoqing Jiang 2021-09-25 10:11:20 UTC
(In reply to Jens Stutte from comment #25)
> Sure:
> 
> root@vdr jens]# cat /proc/mdstat 
> Personalities : [raid1] 
> md0 : active raid1 sdb1[3] sdc1[2]
>       1953381440 blocks super 1.2 [2/2] [UU]
>       bitmap: 1/15 pages [4KB], 65536KB chunk
> 
> unused devices: <none>

There is no write mostly device above, however the backlog is not zero per below

> 
> [root@vdr jens]# cat /sys/block/md0/md/bitmap/backlog 
> 4096
> 
> [root@vdr jens]# cat /sys/block/md0/md/bitmap/max_backlog_used 
> 4096
> 
> and for completeness:
> 
> [root@vdr jens]# mdadm -X /dev/sdb1
>         Filename : /dev/sdb1
>            Magic : 6d746962
>          Version : 4
>             UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
>           Events : 32804
>   Events Cleared : 32804
>            State : OK
>        Chunksize : 64 MB
>           Daemon : 5s flush period
>       Write Mode : Allow write behind, max 4096
>        Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
>           Bitmap : 29807 bits (chunks), 0 dirty (0.0%)
> 
> [root@vdr jens]# mdadm -X /dev/sdc1
>         Filename : /dev/sdc1
>            Magic : 6d746962
>          Version : 4
>             UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f
>           Events : 32804
>   Events Cleared : 32804
>            State : OK
>        Chunksize : 64 MB
>           Daemon : 5s flush period
>       Write Mode : Allow write behind, max 4096
>        Sync Size : 1953381440 (1862.89 GiB 2000.26 GB)
>           Bitmap : 29807 bits (chunks), 0 dirty (0.0%)

Err, I suppose you added "--write-behind=4096" when create array, but you didn't specify write mostly device. Per "man mdadm"

--write-behind=
    Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256. A write-intent bitmap is required in order to use write-behind mode, and write-behind is only attempted on drives marked as write-mostly. 

If my guess is correct, then we need to prohibit mdadm to create array like this. Also we should not write backlog node if no write-mostly device available.

OTOH, could you test below changes? Thanks.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 19598bd38939..6ba12f0f0f03 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1496,7 +1496,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
                if (!r1_bio->bios[i])
                        continue;
 
-               if (first_clone) {
+               if (first_clone && test_bit(WriteMostly, &rdev->flags)) {
Comment 27 Jens Stutte 2021-09-25 10:23:03 UTC
Will do. I propose to also move the first_clone = 0 out of that if condition (actually I am wondering why we do not use a boolean here, too).
Comment 28 Jens Stutte 2021-09-25 10:27:22 UTC
Actually I do not really understand what that first_clone wants to say. I would have assumed that I try a write_behind on all devices flagged with write_mostly. This variable seems to allow it only for the device that happens to be first in the list, which seems like a random choice to me. Am I missing something?
Comment 29 Jens Stutte 2021-09-25 10:37:24 UTC
(In reply to Jens Stutte from comment #28)
> Actually I do not really understand what that first_clone wants to say. I
> would have assumed that I try a write_behind on all devices flagged with
> write_mostly. This variable seems to allow it only for the device that
> happens to be first in the list, which seems like a random choice to me. Am
> I missing something?

Sorry fo the noise, the 

		if (!r1_bio->bios[i])
			continue;

happens before and we want to initialize that memory just one time if it is at all needed (it is part of device independent structures).
Comment 30 Guoqing Jiang 2021-09-25 10:45:22 UTC
Only need to allocate behind bio one time, then bio_clone_fast clones behind bio.
Comment 31 Jens Stutte 2021-09-25 12:42:41 UTC
Created attachment 298969 [details]
patch that allow write behind only if write mostly is set

I tried to make variable names more understandable, too. Hope I got it right!
Comment 32 Jens Stutte 2021-09-25 12:43:23 UTC
And yes, it seems to work for me (I'll let the machine run for a while, though).
Comment 33 Jens Stutte 2021-09-26 09:28:00 UTC
(In reply to Jens Stutte from comment #32)
> And yes, it seems to work for me (I'll let the machine run for a while,
> though).

FWIW, still running fine here.
Comment 34 Jens Stutte 2021-09-26 09:41:16 UTC
Recap:

In 5.12 bvec_alloc introduced a BUG if called with a too big vector #. This triggered a bug in raid1_write_request that happens only if we have a RAID1 configuration without a write-mostly device but with write-behind enabled, which is a non-expected configuration, as write-behind on a read-device is probably unsafe for immediate reading after write. (I assume that this is mostly mitigated by the linux file cache such that I did not notice any problems, but in case of high memory pressure it might still cause trouble).
The patch that went into 5.14.7-arch1 did instead keep the assumption, that write-behind is only active with at least one write-mostly device in the array and thus failed to repair the situation on this bogus configuration.
The last patch submitted here ensures that such a misconfiguration does not lead to problems, the consequence is that no write-behind will be used at all on a this kind of configuration.
Comment 35 Ian Dall 2021-09-29 02:39:51 UTC
I'm seeing a similar problem, but not sure that it is the same.

I do get kernel BUG at block/bio.c:52! in my logs. I do have a raid1 configuration, with a write behind device, but is IS also write-mostly so doesn't fit the above profile.

Approximately once per week I get a freeze, where there is some messages scrolling continuously on the console (far too fast to read) and the only solution is to reboot. The last logged message is always the `BUG at block/bio.c:52!` message.

This is with kernel 5.13.14-200.fc34. I think this has the bcache patch commit 41fe8d088e96472f63164e213de44ec77be69478 in it.
Comment 36 Guoqing Jiang 2021-09-30 15:59:03 UTC
(In reply to Ian Dall from comment #35)
> I'm seeing a similar problem, but not sure that it is the same.
> 
> I do get kernel BUG at block/bio.c:52! in my logs. I do have a raid1
> configuration, with a write behind device, but is IS also write-mostly so
> doesn't fit the above profile.

Given your array has write-mostly device, I suppose your problem can be fixed by 6607cd319b6b91bff94e90f798a61c031650b514, let me know if it doesn't work.
Comment 37 Jens Stutte 2021-11-21 16:08:10 UTC
I just installed 5.15.3.arch1-1 which contains patch https://github.com/archlinux/linux/commit/3c0049a01b2176413696177f837b3ef33860ec13.

So far it seems to solve the problem for good now. Thanks!
Comment 38 Jens Stutte 2021-12-04 15:50:31 UTC
Continues to work for me here.