Created attachment 296953 [details] kernel BUG at block/bio.c:52! The [patch `block: use bi_max_vecs to find the bvec pool`](https://www.spinics.net/lists/linux-block/msg64802.html) causes/reveals a kernel bug (see attachment). I assume, there has always been something wrong somewhere in md that has been gently mooted before. I am unsure if this is really such serious to justify a BUG() here. I have a RAID 1 setup with two 2TB SSDs here. I am happy to provide any further information, of course.
Link to the diff of the patch that introduced this BUG(): https://github.com/torvalds/linux/commit/7a800a20ae6329e803c5c646b20811a6ae9ca136#diff-8192395c9eecb8e11533bf38d0ee498d5778724d859ca0a7cb2fdbabe8002eb2
*** Bug 213303 has been marked as a duplicate of this bug. ***
*** Bug 213249 has been marked as a duplicate of this bug. ***
This bug appears to be mentioned here: https://www.spinics.net/lists/linux-bcache/msg10077.html The provided a temporary patch further in the thread: https://www.spinics.net/lists/linux-bcache/msg10108.html
the patch seems to work fine... -arne
5.12.11 works fine since hours... -arne
Hi Arne, did you test with a bcache setup or plain md? I see this patch touches only bcache related files, and I am not using bcache here at all (and I'd rather want to test on that machine only if I can hope that it works :-) ). Thanks!
Hi Jens, I only tested it with bcache. the changelog doesnt say anything about md. right? https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.12.11 so md might still be affected. -arne
The last comment from Ilgiz on https://bugs.archlinux.org/task/70992 indicates that it is not solved yet for md users.
Does the raid1 mentioned in archlinux's bug has write-behind device? Anyway, could you try the small change? Thanks. --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1456,6 +1456,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, goto retry_write; } + max_sectors = min_t(uint32_t, max_sectors, BIO_MAX_VECS * PAGE_SECTORS); if (max_sectors < bio_sectors(bio)) { struct bio *split = bio_split(bio, max_sectors, GFP_NOIO, &conf->bio_split);
Hi Guoging, yes, the devices in this array seem to have Write-nehind enabled: # mdadm -X /dev/sdb1 Filename : /dev/sdb1 Magic : 6d746962 Version : 4 UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f Events : 32707 Events Cleared : 32707 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Allow write behind, max 4096 Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 0 dirty (0.0%) # mdadm -X /dev/sdc1 Filename : /dev/sdc1 Magic : 6d746962 Version : 4 UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f Events : 32707 Events Cleared : 32707 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Allow write behind, max 4096 Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 0 dirty (0.0%) I cannot safely try this patch any time soon, as I am away from the machine and I need it up and running.
Trying the patch above leads to a compile error, as PAGE_SECTORS is not defined. I can find its definition inside "bcache/util.h", but it feels kind of wrong to include this inside of raid1.c ?
That is why I sent the patch. https://lore.kernel.org/linux-block/20210721025315.1729118-1-guoqing.jiang@linux.dev/ Or you can just define them in raid1.c temporarily to just verify the change.
OK, I just copied over the define inside raid1.c for testing. So far the machine runs well with the kernel (2h uptime for now, compiling firefox seems to work). I just found one stack trace via journalctl, but I think it is unrelated: Process 1039 (baloo_file) of user 1000 dumped core. Stack trace of thread 1918: #0 0x00007fd34828f3a8 n/a (liblmdb.so + 0x43a8) #1 0x00007fd348291efe n/a (liblmdb.so + 0x6efe) #2 0x00007fd348292644 n/a (liblmdb.so + 0x7644) #3 0x00007fd348292c50 mdb_get (liblmdb.so + 0x7c50) #4 0x00007fd3497a466a _ZN5Baloo10DocumentDB3getEy (libKF5BalooEngine.so.5 + 0xd66a) I'll keep it running for a while but without the patch it crashed immediately and repeatedly so I am kind of optimistic that this helps.
Okay, let's wait, I will send a formal patch with your reported-by and tested-by once you confirm the problem is not happened. Thanks for test it!
FWIW, it is still running well.
Hi, the custom kernel is still running without problems ever since. If I understand the discussion around the patch in the mailing list right, it should bow be reviewed as correct fix, too? Thank you
BTW: I was wondering how to disable write-behind as it seems to be a remainder of an old configuration (when I used an SSD and a spinning disk), but there seems to be no easy way. Re-adding the device with --readwrite does not change anything, AFAICS.
(In reply to Jens Stutte from comment #17) > Hi, the custom kernel is still running without problems ever since. If I > understand the discussion around the patch in the mailing list right, it > should bow be reviewed as correct fix, too? > Thank you Yes, I think so
(In reply to Jens Stutte from comment #18) > BTW: I was wondering how to disable write-behind as it seems to be a > remainder of an old configuration (when I used an SSD and a spinning disk), > but there seems to be no easy way. Re-adding the device with --readwrite > does not change anything, AFAICS. Hi, I'd suggest to either 1. file another bug, then describe the issue in that bug or 2. report it to linux-raid list. Thanks.
Created attachment 298947 [details] Still crash with 5.14.7-arch1-1 Hello, I just tried kernel 5.14.7-arch1-1 which is supposed to contain the patch, IIUC. Unfortunately the problem persists, see attached log. With the custom kernel I built from the initial tentative patch (that looks a bit different from what went into the kernel sources) it still works, instead.
AFAICS the condition that checks if alloc_behind_master_bio is called does not contain the pre-calculated write_behind flag we use in the patch to check if we should limit max_sectors (together with a nullptr check on bitmap). I am not sure, how this is possible, but I would then assume we should just check for bitmap.
Created attachment 298951 [details] Check max_sectors always if there is a bitmap. With the attached patch it returns to work. The condition used to decide if we need to split differed from the condition used to decide when to alloc. This patch just splits always if there is a bitmap and the max_sectors is too big. Not sure if it is the most optimized way, but it works. (Actually I first tried with a variant that would only alloc if the WriteMostly bit is set for that specific drive, but that seemed to have negative effects on the performance, though I did no measurements)
Hmm, it means write behind io happens even without write-mostly device ..., perhaps somewhere is wrong. Could you post output of these cmds? 1. cat /proc/mdstat 2. cat /sys/block/md0/md/bitmap/{backlog,max_backlog_used} - replace md0 with the actual md node.
Sure: root@vdr jens]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[3] sdc1[2] 1953381440 blocks super 1.2 [2/2] [UU] bitmap: 1/15 pages [4KB], 65536KB chunk unused devices: <none> [root@vdr jens]# cat /sys/block/md0/md/bitmap/backlog 4096 [root@vdr jens]# cat /sys/block/md0/md/bitmap/max_backlog_used 4096 and for completeness: [root@vdr jens]# mdadm -X /dev/sdb1 Filename : /dev/sdb1 Magic : 6d746962 Version : 4 UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f Events : 32804 Events Cleared : 32804 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Allow write behind, max 4096 Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 0 dirty (0.0%) [root@vdr jens]# mdadm -X /dev/sdc1 Filename : /dev/sdc1 Magic : 6d746962 Version : 4 UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f Events : 32804 Events Cleared : 32804 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Allow write behind, max 4096 Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 0 dirty (0.0%)
(In reply to Jens Stutte from comment #25) > Sure: > > root@vdr jens]# cat /proc/mdstat > Personalities : [raid1] > md0 : active raid1 sdb1[3] sdc1[2] > 1953381440 blocks super 1.2 [2/2] [UU] > bitmap: 1/15 pages [4KB], 65536KB chunk > > unused devices: <none> There is no write mostly device above, however the backlog is not zero per below > > [root@vdr jens]# cat /sys/block/md0/md/bitmap/backlog > 4096 > > [root@vdr jens]# cat /sys/block/md0/md/bitmap/max_backlog_used > 4096 > > and for completeness: > > [root@vdr jens]# mdadm -X /dev/sdb1 > Filename : /dev/sdb1 > Magic : 6d746962 > Version : 4 > UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f > Events : 32804 > Events Cleared : 32804 > State : OK > Chunksize : 64 MB > Daemon : 5s flush period > Write Mode : Allow write behind, max 4096 > Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) > Bitmap : 29807 bits (chunks), 0 dirty (0.0%) > > [root@vdr jens]# mdadm -X /dev/sdc1 > Filename : /dev/sdc1 > Magic : 6d746962 > Version : 4 > UUID : 5532ffda:ccbc790f:b50c4959:8f0fd43f > Events : 32804 > Events Cleared : 32804 > State : OK > Chunksize : 64 MB > Daemon : 5s flush period > Write Mode : Allow write behind, max 4096 > Sync Size : 1953381440 (1862.89 GiB 2000.26 GB) > Bitmap : 29807 bits (chunks), 0 dirty (0.0%) Err, I suppose you added "--write-behind=4096" when create array, but you didn't specify write mostly device. Per "man mdadm" --write-behind= Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256. A write-intent bitmap is required in order to use write-behind mode, and write-behind is only attempted on drives marked as write-mostly. If my guess is correct, then we need to prohibit mdadm to create array like this. Also we should not write backlog node if no write-mostly device available. OTOH, could you test below changes? Thanks. diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 19598bd38939..6ba12f0f0f03 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1496,7 +1496,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, if (!r1_bio->bios[i]) continue; - if (first_clone) { + if (first_clone && test_bit(WriteMostly, &rdev->flags)) {
Will do. I propose to also move the first_clone = 0 out of that if condition (actually I am wondering why we do not use a boolean here, too).
Actually I do not really understand what that first_clone wants to say. I would have assumed that I try a write_behind on all devices flagged with write_mostly. This variable seems to allow it only for the device that happens to be first in the list, which seems like a random choice to me. Am I missing something?
(In reply to Jens Stutte from comment #28) > Actually I do not really understand what that first_clone wants to say. I > would have assumed that I try a write_behind on all devices flagged with > write_mostly. This variable seems to allow it only for the device that > happens to be first in the list, which seems like a random choice to me. Am > I missing something? Sorry fo the noise, the if (!r1_bio->bios[i]) continue; happens before and we want to initialize that memory just one time if it is at all needed (it is part of device independent structures).
Only need to allocate behind bio one time, then bio_clone_fast clones behind bio.
Created attachment 298969 [details] patch that allow write behind only if write mostly is set I tried to make variable names more understandable, too. Hope I got it right!
And yes, it seems to work for me (I'll let the machine run for a while, though).
(In reply to Jens Stutte from comment #32) > And yes, it seems to work for me (I'll let the machine run for a while, > though). FWIW, still running fine here.
Recap: In 5.12 bvec_alloc introduced a BUG if called with a too big vector #. This triggered a bug in raid1_write_request that happens only if we have a RAID1 configuration without a write-mostly device but with write-behind enabled, which is a non-expected configuration, as write-behind on a read-device is probably unsafe for immediate reading after write. (I assume that this is mostly mitigated by the linux file cache such that I did not notice any problems, but in case of high memory pressure it might still cause trouble). The patch that went into 5.14.7-arch1 did instead keep the assumption, that write-behind is only active with at least one write-mostly device in the array and thus failed to repair the situation on this bogus configuration. The last patch submitted here ensures that such a misconfiguration does not lead to problems, the consequence is that no write-behind will be used at all on a this kind of configuration.
I'm seeing a similar problem, but not sure that it is the same. I do get kernel BUG at block/bio.c:52! in my logs. I do have a raid1 configuration, with a write behind device, but is IS also write-mostly so doesn't fit the above profile. Approximately once per week I get a freeze, where there is some messages scrolling continuously on the console (far too fast to read) and the only solution is to reboot. The last logged message is always the `BUG at block/bio.c:52!` message. This is with kernel 5.13.14-200.fc34. I think this has the bcache patch commit 41fe8d088e96472f63164e213de44ec77be69478 in it.
(In reply to Ian Dall from comment #35) > I'm seeing a similar problem, but not sure that it is the same. > > I do get kernel BUG at block/bio.c:52! in my logs. I do have a raid1 > configuration, with a write behind device, but is IS also write-mostly so > doesn't fit the above profile. Given your array has write-mostly device, I suppose your problem can be fixed by 6607cd319b6b91bff94e90f798a61c031650b514, let me know if it doesn't work.
I just installed 5.15.3.arch1-1 which contains patch https://github.com/archlinux/linux/commit/3c0049a01b2176413696177f837b3ef33860ec13. So far it seems to solve the problem for good now. Thanks!
Continues to work for me here.